top | item 30633649

(no title)

> Are transformers competitive with (for example) CNNs on vision-related tasks when there's less data available?

They can be, there's current research into the tradeoffs between local inductive bias (information from local receptive fields: CNNs have strong local inductive bias) and global inductive bias (large receptive fields: i.e. attention). There's plenty of works that combine CNNs and Attention/Transformers. A handful of them focus on smaller datasets, but the majority are more interested in ImageNet. There's also work being done to change the receptive fields within attention mechanisms as a means to balance this.

> Are transformers ever used with that scale of data?

So there's a yes and no to your question. But definitely yes since people have done work on Flowers102 (6.5k training) and CIFAR10 (50k training). Keep in mind that not all these models are pure transformers. Some have early convolutions or intermediate ones. Some of these works even have a smaller number of parameters and better computational efficiency than CNNs.

But more importantly, I think the big question is about what type of data you have. If large receptive fields are helpful to your problem then transformers will work great. If you need local receptive fields then CNNs will tend to do better (or combinations of transformers and CNNs or reduced receptive fields on transformers). I doubt there will be a one size fits all architecture.

One thing to also keep in mind is that transformers typically like heavy amounts of augmentation. Not all data can be augmented significantly. There's also pre-training and knowledge transfer/distillation.

discuss

No comments yet.