top | item 43417598

(no title)

deet | 11 months ago

Vision LLMs are definitely an interesting application.

At Avy.ai we're running small (2B-7B, quantized) vision models as part of a Mac desktop application for understanding what someone is working on in the moment, to offer them related information and actions.

We found that the raw results in understanding the images with a light LORA fine tune are not substantially different -- but the ease of getting a small model to follow instructions in outputting structured data in response to the image and at the level of verbosity and detail we need is greatly enhanced with fine tuning. Without fine tuning the models on the smaller end of that scale would be much more difficult to use, not reliably producing output that matched what the consuming application expects

discuss

msp26|11 months ago

Was constrained decoding not enough to force the output to be in a specific format?

deet|11 months ago

Using a grammar to force decoding say valid JSON would work, but that hasn't always been available in the implementations we've been using (like MLX). Solvable by software engineering and adding that to the decoders in those frameworks, but fine tuning has been effective without that work.

The bigger thing though was getting the models to have the appropriate levels of verbosity and detail in their ouput which fine tuning made more consistent.