stellaathena's comments

stellaathena | 3 years ago | on: Stable Diffusion launch announcement

Stable Diffusion produces substantially higher quality images in most context, but is much more expensive to produce. The genius of VQGAN-CLIP is that it showed that you could take two pre-existing models and combine them to get text-to-image synthesis to work at all. By contrast, models like DALL-E and Stable Diffusion require extremely expensive pretraining.

There's a discussion of this in the VQGAN-CLIP paper, see in particular 6.1 "Efficiency as a Value" https://arxiv.org/abs/2204.08583

Disclaimer: I'm one of the authors of the VQGAN-CLIP paper and was tangentially involved with Stable Diffusion.

stellaathena | 4 years ago | on: Accidentally Turing-Complete

I/O is covered by Turing completeness. Whether timing is depends significantly on how exactly you want to formalize things, but certainly any TC system has the capacity to keep track of time if the pieces are manipulated at a constant rate.

stellaathena | 4 years ago | on: Announcing GPT-NeoX-20B

~40 GB with standard optimization. I suspect you can shrink it down more with some work, but it would require significant innovation to cram it into the next largest common chip size (24 GB, unless I’m misremembering)

stellaathena | 4 years ago | on: A Systematic Investigation of Commonsense Understanding in Large Language Models

This paper doesn't make a whole lot of sense to me. I'm not familiar with any work that claims meaningful zero-shot performance in models as small as the ones considered in this paper. Quite the opposite, both the GPT-3 and FLAN papers claim that zero-shot behavior doesn't arise until much larger than 7B. The T0 paper is the paper that seems closest to the claims in this paper, but even then they're talking about a multitask trained 11B parameter model, not a 7B "normal" model.

The paper opens by saying "Large language models (with more than 1 billion parameters) perform well on a range of natural language processing (NLP) tasks in zero- and few-shot settings, without requiring task-specific supervision," and cite Radford et al. (2019), Brown et al. (2020), and Patwary et al., (2021) for this sentence. But the first paper doesn't claim zero-shot performance and the other two sources are about models orders of magnitude larger! I don't see any evidence in this paper to support the idea that there are people going around claiming that 1B+ parameter models have impressive zero-shot performance.

They make a similar error that reinforces this one in their conclusion as well. They say "At first sight, these models show impressive zero-shot performance suggesting that they capture commonsense knowledge," completely ignoring the fact that they did not in fact ever show that. They also never explain how large their "SOTA" model is, which seems quite important. They additionally never compare to the models that are actually claiming zero-shot performance. Their model performs similarly to GPT-3 13B on HellaSwag, 1.3B on Winogrande, and 6.7B on PiQA. There's clearly some large confounding factors they aren't controlling for here.

The fact that ML researchers, and especially NLP researchers, use extremely low quality baselines is not news. People publish papers pointing this out all the time. The same is true of the fact that many NLP datasets are garbage. "PiQA and HellaSwag are bad evaluation metrics" is potentially a worthwhile inclusion to the literature (I haven't checked if these particular datasets have been critiqued) but is something personally known to me and something that no NLP researcher should find surprising. If people are surprised by this, I think that those people really need to spend more time reading the literature and evaluating datasets. Your default assumption should be that a benchmark eval is loosely correlated with what it nominally measures. And all of these things really have no bearing on zero shot generalization.

If the primary value of this paper is pointing out that the datasets are bad, that can manifest as misleading zero-shot scores but it also manifests as misleading few-shot scores and misleading fine-tuned scores. I would expect a finetuned and few-shot version of Fig 6 to look pretty much the same, but we are not shown such plots.

Why? I can't be sure, but the authors take themselves to be specifically criticizing zero-shot claims and if those plots look as I expect it would significantly undermine their claims. And even if they don't, these models are not being evaluated in a regime in which people are actually claiming significant zero-shot performance. The paper's entire framing is predicated on the false claim that people are claiming that 1B+ parameter models are zero-shot learners.

stellaathena | 4 years ago | on: T0* – Series of encoder-decoder models trained on a large set of different tasks

Minor correction: I (Stella Biderman) am a contributor to BigBench, have read many of its tasks, and have had access to it for months. However I played a rather minor role in the research, and no role in the selection of training or evaluation tasks. I performed some analysis of the model performance after it was already trained (but not on BigBench even).

stellaathena | 4 years ago | on: T0* – Series of encoder-decoder models trained on a large set of different tasks

[Disclaimer: I am an author of the above paper and played a rather minimal role. I am also a prominent member of EleutherAI.]

"Instruction-tuning" is clearly in the air. Simultaneous work at Google (released less than two weeks ago) on a model they call FLAN can be found here: https://ai.googleblog.com/2021/10/introducing-flan-more-gene...

EleutherAI attempted to do something similar several months ago, but didn't succeed: https://blog.eleuther.ai/tuning-on-eval-harness/

A careful analysis of the similarities and differences between the three approaches would be likely highly beneficial to the community.

stellaathena | 4 years ago | on: How to avoid machine learning pitfalls: a guide for academic researchers

This paper is weirdly similar to one I wrote for #NeurIPS's "I Can't Believe It's Not Better!" Workshop. Had I not been constrained by the structure and rules of the workshop, I would have pretty much written this exact paper. The title is even quite close to "Pitfalls in Machine Learning Research: Reexamining the Development Cycle."

http://proceedings.mlr.press/v137/biderman20a/biderman20a.pd...

stellaathena | 4 years ago | on: EleutherAI One Year Retrospective

Turning down hundreds of thousands of dollars of cloud credits because of vendor lock-in is extremely shortsighted for the overwhelming majority of projects.

stellaathena | 4 years ago | on: GPT-J-6B – A 6 billion parameter, autoregressive text generation model

In Google’s defense, it’s not that Ben didn’t go to college it’s that he’s still a college student. This is less “experienced ML dev iced out over lack of degree” and more “college kid does something amazing and some people aren’t sold to hire him on the spot.”

That said, I wouldn’t feel bad for Ben. The world is his oyster.

stellaathena | 5 years ago | on: The makers of Eleuther hope it will be an open source alternative to GPT-3

> Their work on GPT-Neo rules me up because they do such a weak job comparing it to the models whose hype they’re riding.

Building open source infrastructure is hard. There does not currently exist a comprehensive open source framework for evaluating language models. We are currently working on building one (https://github.com/EleutherAI/lm-evaluation-harness) and are excited to share results when we have the harness built.

If you don’t think the model works, you are welcome to not use it and you are welcome to produce evaluations showing that it doesn’t work. We would happily advertise your eval results side by side with our own.

I am curious where you think we are riding the hype /to/ so to speak. The attention we’ve gotten in the last two weeks has actually been a net negative from a productivity POV, as it’s diverted energy away from our larger modeling work towards bug fixes and usability improvements. We are a dozen or so people hanging out in a discord channel and coding stuff in our free time, so it’s not like we are making money or anything based on this either.

page 2