Can you elaborate?
In my experience it is the opposite: CNNs are highly depend on the input tensor shapes thus resolution change need even an architectional change. While
resolution changes in ViT lead to more tokens, a ViT model can handle that (for image classification e.g. you always take the CLS token, Segmentation maps and similar task have the same output as in the input).
No comments yet.