I’m the Co-founder and CTO of Krea. We’re excited because we wanted to release the weights for our model and share it with the HN community for a long time.
My team and I will try to be online and try to answer any questions you may have throughout the day.
Any plans to get into working with the Flux 'Kontext' version, the editing models? I think the use cases of such prompted image editing is just wildly huge. Their demo blew my mind, although I haven't seen the quality of the open weight version yet. It is also a 12B distill.
Regarding the P(.|photo) vs P(.|minimal) example, how do you actually decide this conflict? It seems to me that photorealism should be a strong default "bias".
My reasoning: If the user types in "a cat reading a book" then it seems obvious that the result should look like a real cat which is actually reading a book. So it obviously shouldn't have an "AI style", but it also shouldn't produce something that looks like an illustration or painting or otherwise unrealistic. Without further context, a "cat" is a photorealistic cat, not an illustration or painting or cartoon of a cat.
In short, it seems that users who want something other than realism should be expected to mention it in the prompt. Or am I missing some other nuances here?
Hi! I'm lead researcher on Krea-1. FLUX.1 Krea is a 12B rectified flow model distilled from Krea-1, designed to be compatible with FLUX architecture. Happy to answer any technical questions :)
Regarding this part: > Since flux-dev-raw is a guidance distilled model, we devise a custom loss to finetune the model directly on a classifier-free guided distribution.
Could you go more into detail on the specific loss used for this and any other possible tips for finetuning this that you might have? I remember the general open source ai art community had a hard time with finetuning the original distilled flux-dev so I'm very curious about that.
From a traditional media production background, where media is produced in separate layers, which are then composited together to create a final deliverable still image, motion clip, and/or audio clip - this type of media production through the creation of elements that are then combined is an essential aspect of expense management, and quality control. Current AI image, video and audio generation methods do not support any of that. ForgeUI did briefly, but that went away, which I suspect because few understand large scale media production requirements.
I guess my point being: do you have any (real) experienced media production people working with you? People that have experience working in actual feature film VFX, animated commercial, and multi-million dollar budget productions?
If you really want to make your efforts a wild success, simply support traditional media production. None of the other AI image/video/audio providers seem to understand this, and it is gargantuan: if your tools plugged into traditional media production, it will be adopted immediately. Currently, they are tentatively and not adopted because they do not integrate with production tools or expectations at all.
I recently ran a training experiment using the same dataset, number of steps, and epochs on both Flux Dev and Flux Krea models.
What stood out to me was that Flux Dev followed the text prompts more accurately, whereas Krea’s generations were more loosely aligned or "off" in terms of prompt fidelity with deformations in body type and the architecture.
Does this suggest that Flux Krea requires more training to achieve strong text-to-image alignment compared to Flux Dev? Or is it possible that Krea is optimized differently (e.g. for style, detail, or artistic variation rather than strict prompt adherence)?
Curious if anyone else has experienced this or has any insight into the differences between these two. Would love to hear your thoughts
Nice release. Ran some preliminary tests using the 12b Txt2Img Krea model. Its biggest wins seems to be raw speed (and possibly realism) but perhaps unsurprisingly did not score any higher on the leaderboard for prompt adherence than the normal Flux.1D model.
On another note, there seem to be some indication that Wan 2.2+ future models might end up becoming significant players in the T2I space though you'll probably need a metric ton of LoRAs to cover some of the lack of image diversity.
Can you point to a URL with the tests you’ve done?
Also, FWIW, this model focus was around aesthetics rather than strict prompt adherence. Not to excuse the bad samples, but to emphasize what was one of the research goals.
It’s a thorny trade-off, but an important one if one wants to get rid of what’s sometimes known as “the flux look”.
Re: Wan 2.2 I’ve also been reading of people commenting about using Wan 2.2 for base generation and Krea for the refiner pass which I thought was interesting.
Can someone ELI5 why the safetensor file is 23.8 GB, given the 12B parameter model? Does the model use closer to 24 GB of VRAM or 12 GB of VRAM. I've always associated a 1 billion parameter = 1 GB of VRAM. Is this estimate inaccurate?
Quick napkin math assuming bfloat16 format : 1B * 16 bits = 16B bits = 2GB.
Since it's a 12B parameter model, you get around ~24GB. Downcasting to bfloat16 from float32 comes with pretty minimal performance degradation, so we uploaded the weights in bfloat16 format.
A parameter can be any size float. Lots of downloadable models are FP8 (8 bits per parameter), but it appears this model is FP16 (16 bits per parameter)
Often, the training is done in FP16 then quantized down to FP8 or FP4 for distribution.
Describing it as "Octopus DJ with no fingers" got rid of the hands for me, but interestingly, also removed every anthropomorphized element of the octopus, so that it was literally just an octopus spinning turntables.
I've never gotten one to make what I am thinking of:
A Galton board. At the top, several inches apart are two holes from which balls drop. One drops blue balls, the other red balls. They form a merged distribution below in columns, demonstrating dual overlapping normal distributions
Imagine one of these: https://imgur.com/a/DiAOTzJ but with two spouts at the top dropping different colored balls
We have not added a separate RTX accelerated version for FLUX.1 Krea, but the model is fully compatible with existing FLUX.1 dev codebase. I don't think we made a separate onnx export for it though. Doing 4~8 bit quantized version with SVDQuant would be a nice follow up so that the checkpoint is more friendly for consumer grade hardware.
I'd recommend you offer a clearly documented pathway for companies to license commercial output usage rights if they get the results they seek (i'll know soon enough!)
The license, as I understand, applies only to the model, not to the images produced? Otherwise they should respect the license of images it was trained on.
I recently ran a training experiment using the same dataset, number of steps, and epochs on both Flux Dev and Flux Krea models.
What stood out to me was that Flux Dev followed the text prompts more accurately, whereas Krea’s generations were more loosely aligned or "off" in terms of prompt fidelity with deformations in body type and the architecture.
Does this suggest that Flux Krea requires more training to achieve strong text-to-image alignment compared to Flux Dev? Or is it possible that Krea is optimized differently (e.g. for style, detail, or artistic variation rather than strict prompt adherence)?
Curious if anyone else has experienced this or has any insight into the differences between these two. Would love to hear your thoughts
I usually use https://github.com/axolotl-ai-cloud/axolotl on Lambda/Together for working with these types of models. Curious what others are using? What is the quickest way to get started? They mention Pre-training and Post-training but sadly didnt provide any reference starter scripts.
Amazing. I can practically smell that owl it looks so darned owl-like.
From the article it doesn’t seem as though photorealism per se was a goal in training; was that just emergent from human preferences, or did it take some specific dataset construction mojo?
I love owls. Photorealism was one of the focus areas for training because "AI look" (e.g. plastic skin) was biggest complaint for FLUX.1 model series. Photorealism was achieved with both careful curation of finetuning and preference dataset.
Cool to see an open weight model for this. But what's the business use case? Is it for people who want to put fake faces on their website that don't look AI generated?
From a business point of view, there are many use-cases. Here's a list in no particular order:
- You can quickly generate assets that can be used _alongside_ more traditional tools such as Adobe Photoshop, After Effects, or Maya/Blender/3ds Max. I've seen people creating diffuse maps for 3D using a mix of diffusion models and manual tweaking with Photoshop.
- Because this model is compatible with the FLUX architecture, we've also seen people personalizing the model to keep products or characters consistent across shots. This is useful in e-commerce and fashion industry. We allow easy training in our website — we labeled it Krea 1 — to do this, but the idea with this release is to encourage people with local rigs and more powerful GPUs to be able to tweak with LoRAs themselves too.
- Then I've seen fascinating use-cases such as UI/UX designers who prompt the model to create icons, illustrations, and sometimes even whole layouts that then they use as a reference (like Pinterest) to refine their designs on Figma. This reminds me of people who have a raster image and then vectorize it manually using the pen tool in Adobe Illustrator.
We also have seen big companies using it for both internal presentations and external ads across marketing teams and big agencies like Publicis.
EDIT: Then there's a more speculative use-case that I have in mind: Generating realistic pictures of food.
While many restaurants have people who either make illustrations of their menu items and others have photographers, the big tail of restaurants do not have the means/expertise to do this. The idea we have from the company perspective is to make it as easy as snapping a few pictures of all your dishes and being able to turn all your menu (in this case) into a set of professional-looking pictures that accurately represent your menu.
Thank you! Glad you find it helpful.
The model is focused on photorealism so it should be able to generate most realistic scenes. Although, I think using 3D engines would be more suitable for typical cases for robotics training since it gives you ground truth data on objects, location, etc.
One interesting use case would be if you are focusing on a robotics task that would require perception of realistic scenes.
We used two types of datasets for post-training. Supervised finetuning data and preference data used for RLHF stage. You can actually use less than < 1M samples to significantly boost the aesthetics. Quality matters A LOT. Quantity helps with generalisation and stability of the checkpoints though.
[+] [-] dvrp|7 months ago|reply
I’m the Co-founder and CTO of Krea. We’re excited because we wanted to release the weights for our model and share it with the HN community for a long time.
My team and I will try to be online and try to answer any questions you may have throughout the day.
[+] [-] mk_stjames|7 months ago|reply
[+] [-] jackphilson|7 months ago|reply
[+] [-] cubefox|7 months ago|reply
My reasoning: If the user types in "a cat reading a book" then it seems obvious that the result should look like a real cat which is actually reading a book. So it obviously shouldn't have an "AI style", but it also shouldn't produce something that looks like an illustration or painting or otherwise unrealistic. Without further context, a "cat" is a photorealistic cat, not an illustration or painting or cartoon of a cat.
In short, it seems that users who want something other than realism should be expected to mention it in the prompt. Or am I missing some other nuances here?
[+] [-] Western0|7 months ago|reply
[+] [-] sangwulee|7 months ago|reply
[+] [-] oompty|7 months ago|reply
Regarding this part: > Since flux-dev-raw is a guidance distilled model, we devise a custom loss to finetune the model directly on a classifier-free guided distribution.
Could you go more into detail on the specific loss used for this and any other possible tips for finetuning this that you might have? I remember the general open source ai art community had a hard time with finetuning the original distilled flux-dev so I'm very curious about that.
[+] [-] bsenftner|7 months ago|reply
I guess my point being: do you have any (real) experienced media production people working with you? People that have experience working in actual feature film VFX, animated commercial, and multi-million dollar budget productions?
If you really want to make your efforts a wild success, simply support traditional media production. None of the other AI image/video/audio providers seem to understand this, and it is gargantuan: if your tools plugged into traditional media production, it will be adopted immediately. Currently, they are tentatively and not adopted because they do not integrate with production tools or expectations at all.
[+] [-] shreyank06|7 months ago|reply
What stood out to me was that Flux Dev followed the text prompts more accurately, whereas Krea’s generations were more loosely aligned or "off" in terms of prompt fidelity with deformations in body type and the architecture.
Does this suggest that Flux Krea requires more training to achieve strong text-to-image alignment compared to Flux Dev? Or is it possible that Krea is optimized differently (e.g. for style, detail, or artistic variation rather than strict prompt adherence)?
Curious if anyone else has experienced this or has any insight into the differences between these two. Would love to hear your thoughts
[+] [-] swyx|7 months ago|reply
what does " designed to be compatible with FLUX architecture" mean and why is that important?
[+] [-] vunderba|7 months ago|reply
https://genai-showdown.specr.net
On another note, there seem to be some indication that Wan 2.2+ future models might end up becoming significant players in the T2I space though you'll probably need a metric ton of LoRAs to cover some of the lack of image diversity.
[+] [-] dvrp|7 months ago|reply
Also, FWIW, this model focus was around aesthetics rather than strict prompt adherence. Not to excuse the bad samples, but to emphasize what was one of the research goals.
It’s a thorny trade-off, but an important one if one wants to get rid of what’s sometimes known as “the flux look”.
Re: Wan 2.2 I’ve also been reading of people commenting about using Wan 2.2 for base generation and Krea for the refiner pass which I thought was interesting.
[+] [-] bangaladore|7 months ago|reply
[+] [-] sangwulee|7 months ago|reply
[+] [-] piperswe|7 months ago|reply
Often, the training is done in FP16 then quantized down to FP8 or FP4 for distribution.
[+] [-] petercooper|7 months ago|reply
[+] [-] vipermu|7 months ago|reply
we prepared a blogpost about how we trained FLUX Krea if you're interested in learning more: https://www.krea.ai/blog/flux-krea-open-source-release
[+] [-] orphea|7 months ago|reply
[+] [-] ilc|7 months ago|reply
"Octopus DJ spinning the turntables at a rave."
The human like hands the DJ sprouts are interesting, and no amount of prompting seems to stop them.
Opinionated, as the paper says.
[+] [-] earthicus|7 months ago|reply
[+] [-] SubiculumCode|7 months ago|reply
Imagine one of these: https://imgur.com/a/DiAOTzJ but with two spouts at the top dropping different colored balls
Its attempts: https://imgur.com/undefined https://imgur.com/a/uecXDzI
[+] [-] CGMthrowaway|7 months ago|reply
[+] [-] bluehark|7 months ago|reply
[+] [-] sangwulee|7 months ago|reply
[+] [-] dvrp|7 months ago|reply
- GitHub repository: https://github.com/krea-ai/flux-krea
- Model Technical Report: https://www.krea.ai/blog/flux-krea-open-source-release
- Huggingface model card: https://huggingface.co/black-forest-labs/FLUX.1-Krea-dev
[+] [-] TuringNYC|7 months ago|reply
[+] [-] dvrp|7 months ago|reply
In a nutshell, it follows the same license as BFL Flux-dev model.
[+] [-] sergiotapia|7 months ago|reply
- cost per image - latency per image
Hope you guys can add it somewhere!
[+] [-] dvrp|7 months ago|reply
Though we wanted to keep this technical blogpost free from marketing fluff, but maybe we over-did it.
However, sometimes it's hard to give an exact price per image, as it depends on resolution, number of steps, whether a LoRA is being used or not, etc.
[+] [-] jacooper|7 months ago|reply
[+] [-] codedokode|7 months ago|reply
[+] [-] hhh|7 months ago|reply
[+] [-] shreyank06|7 months ago|reply
Does this suggest that Flux Krea requires more training to achieve strong text-to-image alignment compared to Flux Dev? Or is it possible that Krea is optimized differently (e.g. for style, detail, or artistic variation rather than strict prompt adherence)?
Curious if anyone else has experienced this or has any insight into the differences between these two. Would love to hear your thoughts
[+] [-] TuringNYC|7 months ago|reply
[+] [-] dvrp|7 months ago|reply
Check this out: https://github.com/krea-ai/flux-krea
Let me see if we can add more details on the blog post and thanks for the flag!
[+] [-] kmavm|7 months ago|reply
From the article it doesn’t seem as though photorealism per se was a goal in training; was that just emergent from human preferences, or did it take some specific dataset construction mojo?
[+] [-] sangwulee|7 months ago|reply
[+] [-] artninja1988|7 months ago|reply
[+] [-] OsrsNeedsf2P|7 months ago|reply
[+] [-] dvrp|7 months ago|reply
From a business point of view, there are many use-cases. Here's a list in no particular order:
- You can quickly generate assets that can be used _alongside_ more traditional tools such as Adobe Photoshop, After Effects, or Maya/Blender/3ds Max. I've seen people creating diffuse maps for 3D using a mix of diffusion models and manual tweaking with Photoshop.
- Because this model is compatible with the FLUX architecture, we've also seen people personalizing the model to keep products or characters consistent across shots. This is useful in e-commerce and fashion industry. We allow easy training in our website — we labeled it Krea 1 — to do this, but the idea with this release is to encourage people with local rigs and more powerful GPUs to be able to tweak with LoRAs themselves too.
- Then I've seen fascinating use-cases such as UI/UX designers who prompt the model to create icons, illustrations, and sometimes even whole layouts that then they use as a reference (like Pinterest) to refine their designs on Figma. This reminds me of people who have a raster image and then vectorize it manually using the pen tool in Adobe Illustrator.
We also have seen big companies using it for both internal presentations and external ads across marketing teams and big agencies like Publicis.
EDIT: Then there's a more speculative use-case that I have in mind: Generating realistic pictures of food.
While many restaurants have people who either make illustrations of their menu items and others have photographers, the big tail of restaurants do not have the means/expertise to do this. The idea we have from the company perspective is to make it as easy as snapping a few pictures of all your dishes and being able to turn all your menu (in this case) into a set of professional-looking pictures that accurately represent your menu.
[+] [-] gshklovski|7 months ago|reply
Does this have any application for generating realistic scenes for robotics training?
[+] [-] sangwulee|7 months ago|reply
One interesting use case would be if you are focusing on a robotics task that would require perception of realistic scenes.
[+] [-] dvrp|7 months ago|reply
[+] [-] erwannmillon|7 months ago|reply
[+] [-] bluehark|7 months ago|reply
[+] [-] sangwulee|7 months ago|reply
[+] [-] Western0|7 months ago|reply
and
Cannot access gated repo for url https://huggingface.co/black-forest-labs/FLUX.1-Krea-dev/res.... Access to model black-forest-labs/FLUX.1-Krea-dev is restricted. You must have access to it and be authenticated to access it. Please log in.