I'm out of context, but shouldn't it be possible to train a LLM-like model for images? (as an alternative to the stable diffusion process)
If you rearrenge all pixels from square-sized images using the Hilbert curve, you should end-up with pixels arranged in 1D, and that shouldn't be much different from "word tokens" that LLMs are used to deal with, right? Like a LLM that only "talks" in pixels.
This would have the benefit that you may be able to use various resolutions during training with the model still "converging" (since the Hilbert curve stabilizes towards infinite resolution).
I'm not sure if the pixels would also need to be linearized, then maybe it could work to represent the RGB values as a 3D cube and also apply a 3D Hilbert curve on it, then you would have a 1D representation of all of the colors.
I don't really know the subject but I guess something like that should be possible.
No need for a Hilbert curve, you can just flatten pixels the usual way (ie X = img.reshape(-1)). The main issue is that attention doesn’t scale that well, and with a 512x512 img the attended region is now 262k tokens, which is a lot. The other issue is that you’d throw away data linearizing colors (why not keep them 3-dimensional?).
The corresponding work you’re looking for is Vision Transformers (ViT) - they work well, but not as great as LLMs, I think, for generation. Also I think people like that diffusion models are comparatively small and expensive - they’d rather wait than OOM.
> Although the aforementioned dataset helps to steer the base language models into "safer" distributions of text, not all biases and toxicity can be mitigated through fine-tuning.
[+] [-] coder543|2 years ago|reply
I notice that the repo hasn’t been updated since April, and a question asking for an update has been ignored for at least a month: https://github.com/Stability-AI/StableLM/issues/83
[+] [-] emadm|2 years ago|reply
[+] [-] courseofaction|2 years ago|reply
Am I misunderstanding this? Is this not a big deal?
[+] [-] emadm|2 years ago|reply
Versions being worked on now will do much better.
GPT 4 is far better and will likely not be beaten by any current open models and approaches but maybe an ensemble of them.
[+] [-] SparkyMcUnicorn|2 years ago|reply
All I see is "compares favorably with GPT-3.5 for some tasks".
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] monlockandkey|2 years ago|reply
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...
[+] [-] victor9000|2 years ago|reply
[+] [-] version_five|2 years ago|reply
[+] [-] jojobas|2 years ago|reply
[+] [-] davidkunz|2 years ago|reply
Note: It's "Llama 2", not "LLaMA 2", they changed the capitalization.
[+] [-] satvikpendem|2 years ago|reply
[+] [-] ilaksh|2 years ago|reply
[+] [-] emadm|2 years ago|reply
[+] [-] swfsql|2 years ago|reply
If you rearrenge all pixels from square-sized images using the Hilbert curve, you should end-up with pixels arranged in 1D, and that shouldn't be much different from "word tokens" that LLMs are used to deal with, right? Like a LLM that only "talks" in pixels.
This would have the benefit that you may be able to use various resolutions during training with the model still "converging" (since the Hilbert curve stabilizes towards infinite resolution).
I'm not sure if the pixels would also need to be linearized, then maybe it could work to represent the RGB values as a 3D cube and also apply a 3D Hilbert curve on it, then you would have a 1D representation of all of the colors.
I don't really know the subject but I guess something like that should be possible.
[+] [-] singhrac|2 years ago|reply
The corresponding work you’re looking for is Vision Transformers (ViT) - they work well, but not as great as LLMs, I think, for generation. Also I think people like that diffusion models are comparatively small and expensive - they’d rather wait than OOM.
[+] [-] famouswaffles|2 years ago|reply
[+] [-] dev_daftly|2 years ago|reply
[+] [-] ilaksh|2 years ago|reply
[+] [-] T-A|2 years ago|reply
https://huggingface.co/stabilityai/FreeWilly2
[+] [-] ssabev|2 years ago|reply
[+] [-] akdor1154|2 years ago|reply
[+] [-] gaogao|2 years ago|reply
[+] [-] ozr|2 years ago|reply
[+] [-] lolinder|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] armatav|2 years ago|reply
[+] [-] nomel|2 years ago|reply
> Although the aforementioned dataset helps to steer the base language models into "safer" distributions of text, not all biases and toxicity can be mitigated through fine-tuning.
[+] [-] jrflowers|2 years ago|reply
[+] [-] c0brac0bra|2 years ago|reply