Moebius: Tiny 0.22B Inpainting Model Matches FLUX.1-Fill-Dev with 15× Faster Inference

Jun 19, 2026 · 20 views · Moebius FLUX.1-Fill-Dev image inpainting parameter distillation lightweight diffusion

The Efficiency Breakthrough

On June 16, 2026, the HUSTVL lab from Huazhong University of Science and Technology dropped Moebius, a 0.22 billion parameter image inpainting framework that directly challenges — and in multiple benchmarks surpasses — the 11.9 billion parameter industrial model FLUX.1-Fill-Dev. According to the paper published on Hugging Face's daily papers feed, Moebius achieves this with less than 2% of the parameters and a >15× acceleration in total inference time. The model is open-sourced and already garners 31 GitHub stars, signaling strong early interest from developers fed up with the compute costs of large generalist models.

Moebius is not a smaller student mimicking a larger teacher in a straightforward way. The team intentionally reconstructed the diffusion backbone to avoid the representation bottleneck that typically kills performance when compressing a model this aggressively. By designing parameter-efficient interaction blocks and a novel distillation strategy that operates entirely in latent space, they managed to preserve the complex interactions that normally require billions of extra parameters. The result is a specialist that the researchers claim “rivals or even surpasses” FLUX.1-Fill-Dev on standard inpainting benchmarks while running on consumer-grade hardware.

Technical Architecture: LλMI Block and Latent-Space Distillation

At the core of Moebius is the Local-λ Mix Interaction (LλMI) block, which replaces the bulk of the standard diffusion transformer with two compact modules: Local-λ and Interactive-λ. The Local-λ module summarizes spatial contexts into fixed-size linear matrices, while Interactive-λ integrates global semantic priors. Together, they preserve long-range dependencies without the quadratic attention overhead that balloons parameter counts in full-size transformers. This design enables a drastic reduction in total parameters to 0.22B, compared to 11.9B for FLUX.1-Fill-Dev.

But squeezing a model that much usually degrades output quality. To counter this, the researchers introduced an adaptive multi-granularity distillation strategy. Critically, they kept the entire distillation process within the latent space, bypassing the extremely expensive pixel-space decoding steps that similar works rely on. By dynamically balancing multiple gradient-based losses — including feature-level and output-level alignment — Moebius gradually learns to reproduce high-fidelity details. The paper explains that this approach avoids the common pitfall of “over-dominance” by a single loss term, adapting the weighting as training progresses. The result is a stable training recipe that unlocks the full capability of the compact architecture.

Real-World Performance vs. FLUX.1-Fill-Dev

The paper presents extensive experiments on natural and portrait datasets to back up the bold claims. Moebius not only matches the 10B-level generalist in quantitative metrics but also produces visually sharper results in some cases where FLUX.1-Fill-Dev struggles with fine texture reconstruction. Without the need for heavy post-processing or iterative refinement, Moebius generates an inpainted image in a fraction of the time: the measured >15× speedup refers to total inference time, meaning from input to final output the entire pipeline runs over 15 times faster.

For developers, the practical benefit is dramatic. Running FLUX.1-Fill-Dev on a single high-end GPU can take several seconds per image; Moebius brings that down to well under a second. With only 0.22B parameters, the memory footprint shrinks to roughly 0.44 GB (for FP16 weights), making on-device and real-time inpainting applications suddenly feasible. The GitHub repository (github.com/hustvl/Moebius) provides a plug-and-play pipeline that supports both prompt-guided and mask-guided inpainting, aiming to lower the barrier for those who previously had to rent cloud A100s just to test high-quality completion models.

Implications for On-Device Inpainting and the Specialist Model Trend

Moebius strengthens a growing industry narrative: generalist foundation models may not be the most efficient path for every task. Image inpainting is a highly specific function — removing objects, filling holes, extending canvases — and a purpose-built specialist can deliver equal or better quality at a fraction of the cost. With sensor-rich mobile devices and AR glasses becoming mainstream, on-device inpainting that runs in real time without sending data to the cloud becomes a competitive advantage. Moebius’s tiny footprint means it could soon appear in photo-editing apps, video post-processing tools, and even browser-based demos where a multi-gigabyte model would be out of the question.

The competitive angle with Black Forest Labs’ FLUX.1-Fill-Dev is also notable. While FLUX.1-Fill-Dev is a versatile tool trained on massive datasets, its size limits adoption outside well-funded cloud environments. Moebius demonstrates that task-specific research can close the gap without requiring the trillion-parameter trajectory. The paper does not claim to beat generalist models on all metrics — only on inpainting benchmarks — but the mere fact that a 0.22B specialist can stand toe-to-toe with a 12B generalist should make commercial model providers pay attention.

What to Watch Next

The HUSTVL team has released the code and pretrained weights, and early adopters are already experimenting with fine-tuning Moebius on custom domains. Given the adaptive multi-granularity distillation recipe, further compression below 0.1B parameters may be possible without major quality drops. If the approach generalizes to other dense prediction tasks like depth estimation or super-resolution, the same blueprint could spawn a family of ultra-lightweight specialists that collectively replace a single monolithic model. For now, Moebius sets a new efficiency benchmark in high-fidelity inpainting, and with FLUX.1-Fill-Dev as the reference point, the message is clear: you don’t need 12 billion parameters to erase an unwanted object from a photo.

Source: HuggingFace Papers

345tool Editorial Team

We are a team of AI technology enthusiasts and researchers dedicated to discovering, testing, and reviewing the latest AI tools to help users find the right solutions for their needs.

我们是一支由 AI 技术爱好者和研究人员组成的团队，致力于发现、测试和评测最新的 AI 工具，帮助用户找到最适合自己的解决方案。

Comments

Loading comments...