top | item 46971287

Show HN: I taught GPT-OSS-120B to see using Google Lens and OpenCV

43 points| vkaufmann | 19 days ago

I built an MCP server that gives any local LLM real Google search and now vision capabilities - no API keys needed.

  The latest feature: google_lens_detect uses OpenCV to find objects in an image, crops each one, and sends them to Google Lens for identification. GPT-OSS-120B, a text-only model with
   zero vision support, correctly identified an NVIDIA DGX Spark and a SanDisk USB drive from a desk photo.

  Also includes Google Search, News, Shopping, Scholar, Maps, Finance, Weather, Flights, Hotels, Translate, Images, Trends, and more. 17 tools total.

  Two commands: pip install noapi-google-search-mcp && playwright install chromium

  GitHub: https://github.com/VincentKaufmann/noapi-google-search-mcp
  PyPI: https://pypi.org/project/noapi-google-search-mcp/

Booyah!

31 comments

order

l1am0|19 days ago

I don't get this. Isn't this the same as saying "I taught my 5 year old to calculate integrals, by typing them into Wolfram Alpha"...so the actual relevant cognitive task (integrals in my example, "seeing" in yours) is outsources to an external API.

Why do I need gpt-oss-120B at all in this scenario? Couldn't I just directly call e.g. gemini-3-pro api from the python script?

reedf1|19 days ago

'Calculating' an integral, is usually done by applying a series of sort of abstract mathematical tricks. There is usually no deeper meaning applied to the solving. If you have profound intuition you can guess the solution to an integral, by 'inspection'.

What part here is the knowing or understanding? Does solving an integral symbolically provide more knowledge than numerically or otherwise?

Understanding the underlying functions themselves and the areas they sweep; has substitution or by-parts, actually provided you with this?

villgax|19 days ago

Booyah yourself, this like being able to call two APIs and calling it learning? I thought you did some VLM stuff with a projection

leumon|19 days ago

Next try actually teaching it to see by training a projector with a vision encoder on gpt-oss.

vkaufmann|19 days ago

About to release GPT-OSS-120B-Vision and GPT-OSS-20B-Vision, how about that! :D

vkaufmann|19 days ago

too slow bro

magic_hamster|19 days ago

> GPT-OSS-120B, a text-only model with zero vision support, correctly identified an NVIDIA DGX Spark and a SanDisk USB drive from a desk photo.

But wasn't it Google Lens that actually identified them?

vessenes|19 days ago

Confused as to why you wouldn’t integrate a local vlm if you want a local llm as the backbone. Plenty of 8b - 30b vlms out there that are visually competent.

vkaufmann|19 days ago

Its meant to be super light weight for people who run 1B, 3B, 8B or 20B models on skinny devices, one Pip install with high impact for one install :D

N_Lens|19 days ago

Looks like a TOS violation to me to scrape google directly like that. While the concept of giving a text only model 'pseudo vision' is clever, I think the solution in its current form is a bit fragile. The SerpAPI, Google Custom Search API, etc. exist for a reason; For anything beyond personal tinkering, this is a liability.

embedding-shape|19 days ago

> Looks like a TOS violation to me to scrape google directly like that

If something was built by violating TOS' and you use that to do more TOS violations against the ones who initially did the TOS violations to build the thing, do they cancel out each other?

Not about GPT-OSS specifically, but say you used Gemma for the same purpose instead for this hypothetical.

vkaufmann|19 days ago

Coolest thing about it is, its 1 pip install to give your local model the ability to see, do Google Searches and use News, Shopping, Scholar, Maps, Finance, Weather, Flights, Hotels, Translate, Images, Trends etc

Easiest and fastest way and the impact is massive

speedgoose|19 days ago

Isn’t SerpAPI about scraping Google through residential proxies as a service ?

vkaufmann|19 days ago

Thought this is "hacker new" bro

TZubiri|19 days ago

have you tried Llama? In my experience it has been strictly better than GPT OSS, but it might depend on specifically how it is used.

embedding-shape|19 days ago

Have you tried GPT-OSS-120b MXFP4 with reasoning effort set to high? Out of all models I can run within 96GB, it seems to consistently give better results. What exact llama model (+ quant I suppose) is it that you've had better results against, and what did you compare it against, the 120b or 20b variant?

tanduv|19 days ago

you eventually get hit with captcha with the playwright approach