top | item 39206375

LLaVA-1.6: Improved reasoning, OCR, and world knowledge

215 points| tosh | 2 years ago |llava-vl.github.io

43 comments

order
[+] devinprater|2 years ago|reply
Wow, this is pretty good. My sort of benchmark is a photo of someone holding my sweet little Ollie. Well, he's not so little and he's not mine anymore, but he'll always be my widdle Ollie!

Anyways, GPT4-vision wasn't always able to tell me that he doesn't really look the most comfortable being held, cause that's a lot of gravity pulling him down. Neither was Llava in the past. But with Llava 1.6 34B, it can, with no further questions asked besides "Please describe this image" as the first user message along with the image. So yeah this is really amazing! Its OCR has also definitely improved. Before, it'd just say text is in another language, but now it just shows the text. Can't wait to have a good enough computer to quickly run this locally.

[+] kromem|2 years ago|reply
It must be a pretty exciting time for you.

I can't imagine how much of a difference it would make when vision models like this can be run at the edge in real time for blind users, including proper triage of relevant scene information narration.

Wild to think about the long tail of accessibility and how that's going to be wildly different with the improvements to generative AI.

[+] fngjdflmdflg|2 years ago|reply
To me this is the money shot:

>LLaVA-1.6 is trained with 32 GPUs for ~1 day, with 1.3M data samples in total. The compute / training data cost is 100-1000 times smaller than others.

[+] pushfoo|2 years ago|reply
Agreed. If the resource usage can be optimized further, it'll be feasible to train specialist models both on-prem and from scratch. That would sidestep the liability and privacy issues of current cloud-based offerings.

Google's search appliances did the same thing before they were retired. Hospitals were especially keen on them. They eliminated HIPAA risks because data never left the hospital intranet, but then Google eliminated the product line in favor of cloud offerings.

[+] mountainriver|2 years ago|reply
Yes but to be clear that’s just the vision layers
[+] mildbyte|2 years ago|reply
Damn, literally a day after I wrote up my experiments[0] with LLaVA 1.5 and computing image embeddings. Interesting to see the performance with the fine-tuned Mistral-7B variant being pretty close to the one with Vicuna-13B - using Mistral 7B is what BakLLaVA did back with LLaVA 1.5.

[0] https://mildbyte.xyz/blog/llama-cpp-python-llava-gpu-embeddi...

[+] sho_hn|2 years ago|reply
Anyone got any fun stories to share of trying to use LLaVA e.g. to make toy robots navigate? How good is it at outputting directions in structured data, guess distances, angles, etc.?

My weekend hacking goal would be something like an RC car that can "drive to the largest plant in the room" or "go hide under the dining table" when prompted by voice. Slowly, by combining some sort of basic SLAM with still image promoting.

This looks quite promising: https://i.imgur.com/DnYWYPl.jpeg

Of course an alternative to doing it one-shot would be to collect lots of pictures + orientation for each, have LLaVA only caption them, then prompt a more generic LLM with that collected world info to pick where to go, etc.

What I like most about this AI stuff is how many neat things it makes achievable in a weekend by a motivated hobbyist that in the past required entire companies to tackle :). DIY/maker life in the AI age has been amazing fun do far.

[+] sho_hn|2 years ago|reply
Apologies, this was the wrong image upload, and it's now too late to edit the post. The intended one was a screenshot of a LLaVA 1.6 demo conversation about it:

https://i.imgur.com/J4yZ8xH.png

[+] gitfan86|2 years ago|reply
My best guess is that you want a supervisor GPT-4 like LLM planning the task and a lower level on-prem LLM doing the tasks like driving from one location to another or grasping an item.

Sending every frame to GTP-4 right now is way too slow. But at Tesla FSD like model can drive from one location to another in a closed environment with perfection. All that is missing is training that style in a roomba/robot form and then having GTP-4 monitor and manage the tasks at a 10 or 20 second interval

[+] sestinj|2 years ago|reply
My current test of image models is generating React code from a screenshot of the top of a HackerNews comment page. Llava-1.6 gave me this (over two responses), which is honestly not bad:

```ts

const CommentForm = () => { // State to hold the user's input const [comment, setComment] = useState(''); // State to hold the list of comments const [comments, setComments] = useState([]);

  // Function for posting a new comment
  const postComment = (e) => {
    e.preventDefault();
    // Add logic here to handle the POST request and update state
    setComments([...comments, { content: comment }]); // Assuming you want the entire object in your state
    setComment(''); // Reset the input field after posting
  };
... ```

```ts

import React from 'react'; import { CommentForm } from './CommentForm';

const App = () => { return ( <div> <h1>Comments</h1> <CommentForm /> {comments && <ul> {comments.map(comment => <li>{comment.content}</li>)} </ul>} </div> ); };

export default App; ```

[+] thelastparadise|2 years ago|reply
Perhaps try two stages.

Stage 1: Generate detailed description from image w/llava.

Stage2: Code the page using miqu 1.

[+] benopal64|2 years ago|reply
Wow! You folks are making huge strides for open-source multimodal models. Thank you for all the time and effort on these as they will open up many opportunities for researchers and developers. Also, the emergent zero-shot capabilities when LLaVA-1.6 is tested against Chinese benchmarks with only English multi-modal training data are interesting and that may be a good direction for future research.
[+] GaggiX|2 years ago|reply
Demo: https://llava.hliu.cc/

My main interest with VLM is their ability to caption images, and this one seems very good honestly, this is going to be super useful to caption datasets.

[+] m00x|2 years ago|reply
CogVLM is also really good at image captioning. I've been using it in the past month and it's shown very good results.

I'm excited to see how this works in practice though.

[+] ranguna|2 years ago|reply
This thought just occurred to me: would it make sense to train a model to recognise vector encoded video frames?

I completely forgot how video encoders work, but I do remember that some encode the differences between one frame and the next with vectores of the "motion" of pixels. The training data would be comprised of frames from a video with labels assigned to time ranges across the length of the video.

The we could feed a video stream to a model and it would learn to not only recognise still images, but also motion across time.

Make it fast enough and we would have near real time inference of video.

If this works, maybe an extension to this model would be to accept its previous inference result as an input to the next frame inference request. Then we'd have results like "a person entered the bright light seen of a sunny day in the country side".

[+] mvelbaum|2 years ago|reply
Is there an OSS model that can do OCR on the level of Google Cloud Vision or Amazon Textract?
[+] andrewmutz|2 years ago|reply
Anyone know how much it costs to run this yourself vs paying money for GPT 4 vision?

The article says the training costs are far lower, but it doesn't say how the inference costs compare (unless I'm missing it)

[+] Departed7405|2 years ago|reply
I tested GPT-4V and Llava 1.6 on a Chinese text and they both hallucinate like crazy. LVMs can still barely recognize caracters while traditional OCR nails it. Do someone know why ?
[+] 7734128|2 years ago|reply
Almost definitely just that it hasn't been trained on such data, rather than the task being inherently difficult. I tried Llava 1.6 on an image with Swedish text and it parsed a large and clear Ä as A while the other letters were mostly correct.
[+] GaggiX|2 years ago|reply
Well for LLaVa-1.6 we know it was trained only on English multi-modal data.
[+] bredren|2 years ago|reply
Can anyone comment on the practical use of a model like this versus traditional libraries for OCR?

I’m specifically interested in processing smartphone photos of pages from an out of print book.

[+] gryn|2 years ago|reply
one potential use case I've had in my and never gotten around to making is using them detailed tag and sort a huge library of images, OCR is useless here. the idea would be to have a more semantic type of search based on the content of the image or its art style.

so far with my test gpt4-v seems to perform better, though very heavy on the censorship guardrails.

this 1.6 perfoms better than the previous version and seems to hallucinate a bit less.

[+] jebarker|2 years ago|reply
One use is that these models can do OCR in the wild, e.g. reading text from a sign on a window in a photo. I think traditional OCR libraries are more focused on reading printed pages.
[+] chx|2 years ago|reply
There's no reasoning involved with LLMs. Please. Words have meaning.
[+] ukuina|2 years ago|reply
What makes you certain there is no reasoning involved here? Is it lack of "intent"? Does the user's prompt not provide sufficient intent to the LLM?

Based on the demo linked in the article, you can specifically prompt "What is unusual about this image? Walk me through your reasoning step by step" and get a thorough understanding of the reasoning behind the LLM's response.

So, yes, words do have meaning, and the word "reasoning" appears apt.

[+] m00x|2 years ago|reply
Untrue, but also incorrect to call it an LLM. This is a VLM.
[+] exe34|2 years ago|reply
How can you tell? Or to avoid proving a negative, how would you measure if reasoning was involved?