stellaathena | 3 years ago | on: Stable Diffusion launch announcement
stellaathena's comments
stellaathena | 4 years ago | on: Accidentally Turing-Complete
stellaathena | 4 years ago | on: Accidentally Turing-Complete
stellaathena | 4 years ago | on: Accidentally Turing-Complete
stellaathena | 4 years ago | on: Announcing GPT-NeoX-20B
stellaathena | 4 years ago | on: Announcing GPT-NeoX-20B
stellaathena | 4 years ago | on: Announcing GPT-NeoX-20B
stellaathena | 4 years ago | on: Announcing GPT-NeoX-20B
stellaathena | 4 years ago | on: Announcing GPT-NeoX-20B
stellaathena | 4 years ago | on: A Systematic Investigation of Commonsense Understanding in Large Language Models
stellaathena | 4 years ago | on: A Systematic Investigation of Commonsense Understanding in Large Language Models
The paper opens by saying "Large language models (with more than 1 billion parameters) perform well on a range of natural language processing (NLP) tasks in zero- and few-shot settings, without requiring task-specific supervision," and cite Radford et al. (2019), Brown et al. (2020), and Patwary et al., (2021) for this sentence. But the first paper doesn't claim zero-shot performance and the other two sources are about models orders of magnitude larger! I don't see any evidence in this paper to support the idea that there are people going around claiming that 1B+ parameter models have impressive zero-shot performance.
They make a similar error that reinforces this one in their conclusion as well. They say "At first sight, these models show impressive zero-shot performance suggesting that they capture commonsense knowledge," completely ignoring the fact that they did not in fact ever show that. They also never explain how large their "SOTA" model is, which seems quite important. They additionally never compare to the models that are actually claiming zero-shot performance. Their model performs similarly to GPT-3 13B on HellaSwag, 1.3B on Winogrande, and 6.7B on PiQA. There's clearly some large confounding factors they aren't controlling for here.
The fact that ML researchers, and especially NLP researchers, use extremely low quality baselines is not news. People publish papers pointing this out all the time. The same is true of the fact that many NLP datasets are garbage. "PiQA and HellaSwag are bad evaluation metrics" is potentially a worthwhile inclusion to the literature (I haven't checked if these particular datasets have been critiqued) but is something personally known to me and something that no NLP researcher should find surprising. If people are surprised by this, I think that those people really need to spend more time reading the literature and evaluating datasets. Your default assumption should be that a benchmark eval is loosely correlated with what it nominally measures. And all of these things really have no bearing on zero shot generalization.
If the primary value of this paper is pointing out that the datasets are bad, that can manifest as misleading zero-shot scores but it also manifests as misleading few-shot scores and misleading fine-tuned scores. I would expect a finetuned and few-shot version of Fig 6 to look pretty much the same, but we are not shown such plots.
Why? I can't be sure, but the authors take themselves to be specifically criticizing zero-shot claims and if those plots look as I expect it would significantly undermine their claims. And even if they don't, these models are not being evaluated in a regime in which people are actually claiming significant zero-shot performance. The paper's entire framing is predicated on the false claim that people are claiming that 1B+ parameter models are zero-shot learners.
stellaathena | 4 years ago | on: T0* – Series of encoder-decoder models trained on a large set of different tasks
stellaathena | 4 years ago | on: T0* – Series of encoder-decoder models trained on a large set of different tasks
"Instruction-tuning" is clearly in the air. Simultaneous work at Google (released less than two weeks ago) on a model they call FLAN can be found here: https://ai.googleblog.com/2021/10/introducing-flan-more-gene...
EleutherAI attempted to do something similar several months ago, but didn't succeed: https://blog.eleuther.ai/tuning-on-eval-harness/
A careful analysis of the similarities and differences between the three approaches would be likely highly beneficial to the community.
stellaathena | 4 years ago | on: How to avoid machine learning pitfalls: a guide for academic researchers
http://proceedings.mlr.press/v137/biderman20a/biderman20a.pd...
stellaathena | 4 years ago | on: EleutherAI One Year Retrospective
stellaathena | 4 years ago | on: EleutherAI One Year Retrospective
stellaathena | 4 years ago | on: GPT-J-6B – A 6 billion parameter, autoregressive text generation model
That said, I wouldn’t feel bad for Ben. The world is his oyster.
stellaathena | 4 years ago | on: LaMDA: Google's New Conversation Technology
stellaathena | 5 years ago | on: The makers of Eleuther hope it will be an open source alternative to GPT-3
stellaathena | 5 years ago | on: The makers of Eleuther hope it will be an open source alternative to GPT-3
Building open source infrastructure is hard. There does not currently exist a comprehensive open source framework for evaluating language models. We are currently working on building one (https://github.com/EleutherAI/lm-evaluation-harness) and are excited to share results when we have the harness built.
If you don’t think the model works, you are welcome to not use it and you are welcome to produce evaluations showing that it doesn’t work. We would happily advertise your eval results side by side with our own.
I am curious where you think we are riding the hype /to/ so to speak. The attention we’ve gotten in the last two weeks has actually been a net negative from a productivity POV, as it’s diverted energy away from our larger modeling work towards bug fixes and usability improvements. We are a dozen or so people hanging out in a discord channel and coding stuff in our free time, so it’s not like we are making money or anything based on this either.
There's a discussion of this in the VQGAN-CLIP paper, see in particular 6.1 "Efficiency as a Value" https://arxiv.org/abs/2204.08583
Disclaimer: I'm one of the authors of the VQGAN-CLIP paper and was tangentially involved with Stable Diffusion.