top | item 46046556

(no title)

jabron | 3 months ago

What do you mean "bounding boxes"? They were talking about captions and embeddings, so a vision language model is required.

discuss

Glemkloksdjf|3 months ago

I suggested YOLO and non llm-vl as a lot faster alternative.

Of course CLIP would be otherwise the other option than a big llm-vl one.