top | item 45981514

Meta Segment Anything Model 3

178 points| alcinos | 3 months ago |ai.meta.com

48 comments

order

vessenes|3 months ago

Released last week. Looks like all the weights are now out and published. Don’t sleep on the SAM 3D series — it’s seriously impressive. They have a human pose model which actually rigs and keeps multiple humans in a scene with objects, all from one 2D photo (!), and their straight object 3D model is by far the best I’ve played with - it got a really very good lamp with translucency and woven gems in usable shape in under 15 seconds.

Qwuke|3 months ago

Between this and DINOv3, Meta is doing a lot for the SOTA even if Llama 4 came up short compared to the Chinese models.

Fraterkes|3 months ago

Are those the actual wireframes they're showing in the demos on that page? As in, do the produced models have "normal" topology? Or are they still just kinda blobby with a ton of polygons

retinaros|3 months ago

I looked quickly but it does not generate a 3d model file right?

enoch2090|3 months ago

Surprisingly, SAM3 works bad on engineering drawings while SAM2 kinda works, and VLMs like Qwen3-VL works as well

zubiaur|3 months ago

Had good luck with Gemini 2.5, SAM3 failed miserably with PIDs.

retinaros|3 months ago

yeah I tried too. Im trying a fine tuning on PIDs.

the_duke|3 months ago

Side question: what are the current top goto open models for image captioning and building image embeddings dbs, with somewhat reasonable hardware requirements?

daemonologist|3 months ago

For pure image embedding, I find DINOv3 to be quite good. For multimodal embedding, maybe RzenEmbed. For captioning I would use a regular multimodal LLM, Qwen 3 or Gemma 3 or something, if your compute budget allows.

NitpickLawyer|3 months ago

Try any of the qwen3-vl models. They have 8, 4 and 2B models in this family.

Glemkloksdjf|3 months ago

I would suggest YOLO. Depending on your domain, you might also finetune these models. Its relativly easy as they are not big LLMs but either image classification or bounding boxes.

I would recommend bounding boxes.

phkahler|3 months ago

Which (if any) of these models could run on a RaspberryPi for object recognition at several FPS?

aliljet|3 months ago

I wonder how effective this is medical scenarios? Segmenting organs and tumors in cat scans or MRIs?

colkassad|3 months ago

Been waiting days to get approval to download this from huggingface. What's up with that?

observationist|3 months ago

Alternative downloads exist. You can find torrents, and match checksums against the HF downloads, but there are also mirrors and clones right there in HF which you can download without even having to log in.

knicholes|3 months ago

I was approved within about 10 minutes for "Segment Anything 3"

shashanoid|3 months ago

Miss the old segment anything page, used it a lot. This UI I found very complex to use

vanjoe|3 months ago

For a long time I've wanted to use something like this to remove advertisements from hockey games.The moving ads on the boards are really annoying. Maybe I'll get around to actually doing that one of these days.

cheesecompiler|3 months ago

This would be convenient for post-production and editing of video, e.g. to aid colour grading in Davinci Resolve. Currently a lot of manual labour goes into tracking and hand-masking in grading.

maelito|3 months ago

I wonder if this can be used to track an object's speed. E.g. a vehicle on a road. It would need to recognize shapes, e.g. car model or average size of a bike, to guess a speed.

Workaccount2|3 months ago

I do a test on multimodal LLMs where I show them a dog with 5 legs, and ask them to count how many legs the dog has. So far none of them can do it. They all say "4 legs".

Segment anything however was able to segment all 5 dog legs when prompted to. Which means that meta is doing something else under the hood here, and may lend itself to a very powerful future LLM.

Right now some of the biggest complaints people have with LLMs stems from their incompetence processing visual data. Maybe meta is onto something here.

jampekka|3 months ago

Segmentation doesn't need to count legs. I'd guess something like YOLO could segment 5 legged dogs too.

nerdsniper|3 months ago

You don’t need segmentation to count legs. Object detection can do that. DeepLabCut from 2020 perhaps.

PunchTornado|3 months ago

I doubt that gemini 3 cannot do it.