
The Benchmark That Matters for AI-Generated 3D Models
ModelRift's Antigravity 2.0 has claimed the top spot on the OpenSCAD Architectural 3D LLM Benchmark, a specialized test that evaluates how well large language models can translate natural language architectural descriptions into functional OpenSCAD code. The benchmark, which consists of over 1,000 structured prompts covering everything from simple geometric shapes to complex multi-room layouts, measures not just syntactic correctness but also the semantic fidelity of the generated 3D models. According to the benchmark's public leaderboard, Antigravity 2.0 outperformed all prior submissions, including fine-tuned variants of GPT‑4 and Claude, by a statistically significant margin. The news, which quickly garnered 406 upvotes on Hacker News within a day of being posted, signals a notable leap in the practical utility of LLMs for 3D design automation.
What Makes Antigravity 2.0 Different

Antigravity 2.0 is built on a custom architecture that combines a large language model backbone with a dedicated geometric correctness layer. Unlike previous models that often produce syntactically valid OpenSCAD code but fail to respect spatial constraints—such as wall thickness or window alignment—Antigravity 2.0 explicitly enforces these rules through a hybrid approach. ModelRift's engineers trained the model on a curated dataset of over 500,000 architectural OpenSCAD scripts, augmented with natural language annotations that describe design intent. This dataset, which the company has partially open‑sourced, covers residential, commercial, and abstract forms. In benchmark evaluations, Antigravity 2.0 achieved a 92% pass rate on structural integrity tests, compared to 78% for the next best model, and reduced the average number of syntax errors per script by 40%. These numbers are particularly relevant for developers who rely on OpenSCAD for rapid prototyping in 3D printing and CAD workflows.
Broader Implications for AI-Driven Design
The OpenSCAD Architectural 3D LLM Benchmark was created to address a gap in existing AI evaluation suites: most benchmarks focus on general‑purpose code generation (e.g., HumanEval) or text‑to‑image tasks, but few measure the ability to produce precise, physically plausible 3D models from natural language. OpenSCAD, a script‑based CAD tool popular in the maker and hobbyist communities, demands that generated code produce valid solids when compiled—a much stricter requirement than typical code generation. By excelling in this niche, Antigravity 2.0 demonstrates that LLMs can be fine‑tuned for domains where spatial reasoning and domain‑specific constraints are critical. This has direct implications for architects exploring generative design, educators creating interactive 3D learning materials, and companies automating the generation of 3D‑printable parts from verbal descriptions.

Comparing Antigravity 2.0 to Alternatives
The benchmark lists several other notable entries, including an OpenSCAD‑specialized LoRA variant of Llama 3.3 and a proprietary model from a stealth startup. Antigravity 2.0's lead comes primarily from its improved handling of multi-step instructions—for example, "a two‑story house with a balcony on the south side and a flat roof"—which often trip up models that treat each clause independently. ModelRift attributes this to the model's use of a hierarchical tokenization scheme that preserves the spatial relationships between rooms and structural elements. Additionally, Antigravity 2.0 is available as an API (priced at $0.03 per 1,000 tokens, comparable to GPT‑4) and as an open‑weight release for research use, a move that has garnered praise from the open‑source community. However, the model is not without limitations: its training dataset skews toward Western architectural styles, and it occasionally produces non‑manifold geometry for very complex prompts. ModelRift has acknowledged these shortcomings and plans to release a version 2.1 with expanded cultural diversity validation.
What to Watch Next
The rapid ascent of Antigravity 2.0 suggests that domain‑specific LLM benchmarks are becoming essential tools for measuring real‑world capability. For developers and architects, the model makes it easier to generate editable, script‑based 3D models without manually writing code. The open‑weights release also lowers the barrier for further fine‑tuning on niche architectural traditions or integration into existing CAD pipelines. As ModelRift continues to iterate, expect to see similar benchmarks emerge for other scriptable CAD formats (e.g., FreeCAD's Python API or Blender's geometry nodes). For now, Antigravity 2.0 provides a concrete example of how LLMs can move beyond chatbot conversations and into the realm of precise, production‑ready design—a shift that could reshape how we think about AI‑aided architecture.
Comments