The SAM paper from this past April (that let you do zero-shot segmentation on any image, seemingly better than even OpenAI's CLIP) was using a ~600M parameter ViT model to generate image embeddings.
And in order to make it less computationally expensive to generate those same embeddings, they replace that model with a smaller ViT encoder that was pre-trained using the masked auto-encoder back propagation method?
The comparison is figure 1 of the paper. I think the bubble size represents number of parameters, which likely roughly corresponds to memory consumption.
skadamat|2 years ago
IshanMi|2 years ago
The SAM paper from this past April (that let you do zero-shot segmentation on any image, seemingly better than even OpenAI's CLIP) was using a ~600M parameter ViT model to generate image embeddings. And in order to make it less computationally expensive to generate those same embeddings, they replace that model with a smaller ViT encoder that was pre-trained using the masked auto-encoder back propagation method?
GaggiX|2 years ago
cchance|2 years ago
yorwba|2 years ago
naveen99|2 years ago
ShadowBanThis01|2 years ago