top | item 35504052

Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows

48 points| jonbaer | 3 years ago |arxiv.org | reply

19 comments

order
[+] nharada|3 years ago|reply
Researchers love the ViT and all its varieties, but a reminder for those following at home that ConvNets still work and scale fine, depending on your requirements. For example, in "A ConvNet for the 2020s"[1], the authors are able to scale up ConvNets to the sizes of ViTs.

From the abstract: "Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation"

[1] https://arxiv.org/abs/2201.03545

[+] neonbjb|3 years ago|reply
You can make almost anything work in DL if you try hard enough, that doesn't mean it is the correct thing to do. Convolutions have inductive biases which are the cause of many of the problems associated with deep learning over the last 10 years. Researchers don't "love the ViT". They use it because it is simply better in every way, in every application.

The only reason convolutions are still used in modern (intelligently designed) ML systems is because it is not known how to build a sparse attention algorithm that achieves 2D and 3D locality and is also compatible with modern accelerators. Swin is an attempt at that, but it is something of a hack.

[+] panabee|3 years ago|reply
given the focus on transformers in recent years, modernizing and adapting CNNs to hardware advances seems like an under-researched area. it's interesting and something we're exploring.

FAIR did great work with this paper.

[+] godelski|3 years ago|reply
It is also important to remember that there are different inductive biases for different networks. Sutton's "Bitter Lesson"[0] argues for scale + more arbitrary complexity. ViTs scale really well because some of this arbitrary complexity. But I agree with your point that what to use is not obvious and this gets really complicated real quickly.

For example, Swin loses some of the properties of self-attention but Neighborhood Attention doesn't[1] (these are both considered reductive attention types). Does this have a large effect? Might depend on your task.

Looking at non-classification tasks is important. This is especially important since ImageNet has a lot of issues with redundancy (e.g. there's a label "sunglass" (836) and "sunglasses" (837)), images with multiple labels that are valid, and more. I'd argue that once ImageNet accuracy is over 80% then the accuracy no longer strongly correlates with downstream tasks[2] (segmentation, detection) or even other tasks like generation. This is why it is really important for researchers to pay closer attention now and we can't just look at benchmarks. Doing so will hinder research. We could previously get away with this because classification previously strongly correlated with downstream performance and that the error rate in ImageNet was much larger than the improvements on accuracy.

Worse than that, some of the main benchmarks we use are highly effected by these biases. For example, convolutions learn texture and so using something like FID[3,4] can have plenty of issues that might not give an actual depiction of how good a network actually is at its task. This is even true for non-deep model based metrics[5] and so you have to be REALLY careful about how you evaluate things.

TLDR: be careful with evaluating benchmarks and evaluate things holistically.

===== Minimal Bib =====

[0] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[1] https://arxiv.org/abs/2204.07143

[2] If you're wondering how this can happen, it is memorization. E.g. if the target image has both rabbits, cars, and other things (real example) in it but the label is "car wheel" (479) then the network has to learn to ignore "car mirror" (475) and rabbits (330,331,332). But downstream tasks like detection and segmentation perform multiple classifications on a single image and using a over-fit backbone can hinder performance.

[3] https://arxiv.org/abs/2203.06026

[4] It is worth noting that FID is calculated form InceptionV3 weights, which was only trained on ImageNet-1k (full dataset is 22k) and had an accuracy <80% (top-5 < 95%). It is a bit weird this is used for datasets like FFHQ because there is no person label. Or even most LSUN classes because ImageNet likely has a texture bias itself (with animals and plants composing the majority of the dataset and texture being an important feature there). Evaluating models is fucking hard and a lot of this isn't internalized my many, especially outside the field.

[5] https://arxiv.org/abs/1511.01844

[+] heyitsguay|3 years ago|reply
Something I've noticed is that there's a ton of work going into better vision transformers, but whenever there's a big new multimodal result or something, everyone's just using ViT. What gives? Like in CLIP or PaLM-E, they very easily could have used another vision architecture, it's not like they were running all modalities through the same encoder.
[+] logophobia|3 years ago|reply
Reminds me of the ideas behind google's multi-axis transformer: https://arxiv.org/abs/2204.01697

Both using a hierarchical transformer, adapting the transformer network architecture to vision tasks more efficiently.