Qwen image. Of course.

alex swain August 10, 2025 0 Comments

09/25 It’s only been a month and this is very outdated already.

Qwen models are some of the best models I’ve used, period. I’ve been using them for a few years and generally when I see an option to use a Qwen model, I’ll pick that.

Qwen has been mostly image to text (which they’ve excelled at along with InternVL which I believe is a fine tune of Qwen). Qwen-Image is their first T2I model (at least that I’m aware of).

I generally quickly test each diffusion model when it comes out because I have a need for much better prompt following and quality for other projects I’m working on (the itch not scratched so far).Flux1d, Hidream, Cosmos, SD3, all pretty good. Noting that Stable Diffusion 1.4 came out in August of 2022, 3 years ago, I think we’ve come a long way. Qwen uses one TE (as opposed to at least 2 for most other models and 4 for HiDream (!?!). Turns out you only need one really good one.

Qwen-Image. Not bad for the first inference though and definitely no Telly in the Qwen datasets. Plasticine porters with looking glass ties still here just like Flux out of the box.

Naturally Qwen uses their own Clip TE which is 2.5VL 7B (already using this one for TE in ComfyUI, a fantastic model). The VAE is also theirs. I’m not surprised that their text representation is above par because the makers of Qwen are incredibly good at training I2T models.

Let’s try something hard.

*“list ten font names and the text from each font you list should be in that font. e.g. if times roman the text should be written in times roman font.”*
Ok well progress to be made either way…Not using BF16 so perhaps that’d help a bit (probably not enough)

Ok, let’s try again. Lets use a stronger Qwen model to generate the prompt to feed it back to itself (QW3-A30b-3b) (thinking model) of which the 2.5VL 7B CLIP will handle it. The input text to the model was “expand upon creatively for a diffusion model prompt: “A vibrant scene: cats walking down an old street in Amsterdam with traditional Dutch pavers, each wearing miniature wooden clogs. Text at the bottom reads “Cats in Clogs”, Qwen 3’s result below:

“A hyper-detailed, impressionistic scene bathed in golden-hour light: A narrow, cobblestone street in Amsterdam, its weathered pavers worn smooth by centuries of foot traffic. The atmosphere hums with life—dutch-style canal boats glide silently past, their reflections shimmering in the water. Cats, each adorned with intricately carved wooden clogs (klompen) in vibrant hues of red, blue, and gold, stride confidently down the street. Their fur glows under the warm light, with some cats pausing to sniff blooming tulips in flower boxes or leaping over puddles that mirror the sky’s soft pastel tones. A gondolier leans lazily on his oar, while a street musician plays a lively accordion nearby. The scene pulses with movement: pigeons flutter overhead, bicycles clatter across the pavers, and a faint mist rises from the canal. At the bottom of the image, bold, hand-painted Dutch typography in white with a black outline reads ‘Cats in Clogs’, set against a darkened background for contrast. In the far distance, a giant letter ‘C’ made of catnip plants looms faintly, adding a whimsical touch to the scene.”
Key Enhancements:
Dynamic lighting: Golden-hour glow adds warmth and depth.
Historic authenticity: Cobblestones, canal boats, and tulips anchor the scene in Amsterdam’s culture.
Cats’ clogs: Detailed, colorful klompen with cultural context.
Text clarity: Bold white font with black outline ensures readability against complex backgrounds.
Whimsical elements: Catnip ‘C’ and playful cat interactions elevate creativity without overwhelming the composition.”

Pretty good.

08.11

Did one additional test this morning- This was to test both the encoder and the model. I used a distilled LoRa because even with a few 4090’s the inference time for 20-30 steps was minutes. Kijai made a “lightning” LoRA in a matter of hours it seems that dropped the steps to 8-10 and with impressive results and after some model strength tuning got the output practically identical to the base model output w/o a distilled LoRA. . As I asked on HF, does this guy sleep? Anyway, 8 steps being doable for testing for a few hours and not time for a few days, we ran it:

On the left is the image generated from the prompt (only) (with some grain added to “de-plasticine” the output). On the right is the source image fed into Ollama using an abliterated version of QW2.5-VL-7B. This is a pretty good result on all fronts (base model, LoRA, and the TE).

Pretty good for their first effort. Kijai too.

08.21

The Qwen-Edit model was released a few days ago. Long-term these sorts of models may dramatically reduce the need for custom LoRA’s.

Tags: diffusers, qwen

Alex Swain