The Use Case — "It just works, first try"
The killer use case isn't code autocomplete. It's this: a solo indie developer opens a blank Godot 4 project, types a game idea in plain English, and Susadhya-GD-100M generates a full, importable, zero-error Godot project — complete multi-file scene trees, proper autoloads, signal wiring, cross-script references, resource paths, and a project.godot — that compiles on first import.
GPT-4 sees GDScript as "Python-ish." It hallucinates @export_range arguments, invents non-existent Node classes (KinematicBody2D in Godot 4), wires signals to methods that don't exist in the target script, and can't maintain consistent res://path references across files it generated. A zero-error Godot import requires ALL of these to be correct simultaneously.
The GDScript grammar is small and regular. The valid API surface of Godot 4 is finite (~800 classes, ~5k methods). A model trained entirely on real Godot 4 repos with AST-graph annotations will have near-perfect grammar adherence and cross-file consistency — the specific failure modes of large models — even at 100M params.
Target User Scenarios
"Generate a Flappy Bird clone for me"
Planner model decomposes → Susadhya writes: Main.tscn, Bird.gd, Pipe.gd, GameManager.gd, project.godot, autoload/Score.gd. Each file references the others correctly. User clicks File → Import and it just runs.
"Add a health system and death screen to my existing project"
Susadhya reads the existing project graph (via AST tool), identifies the Player node and existing signals, generates a HealthComponent.gd that connects to the existing Player.gd via the correct signal names — no hallucinated method calls.
VSCode / Godot plugin: inline repo-aware completion
As you type in Enemy.gd, Susadhya knows that Player.gd in your project has a take_damage(amount: int) method and health_changed signal — auto-completes cross-file references correctly, offline, in <40ms on the user's M-series Mac.
Model Architecture — MoE-MLA Hybrid
Standard Llama-style is the safe choice. But given the latest research (Mehta et al., 2025: "MoE-MLA-RoPE achieves 68% KV cache reduction and 3.2× inference speedup while using 42% fewer active parameters per forward pass"), Susadhya adopts a lightweight MoE + MLA architecture that is more efficient at inference while matching or beating a dense baseline in perplexity.
Phase 1 — Dense Baseline (Weeks 1–10)
| Hyperparameter | Value | Rationale |
|---|---|---|
| Layers | 12 | Standard for 100M; 12 × 768 × 12heads = ~98M params |
| Hidden size (d_model) | 768 | Fits in M4 Pro 24GB with seq=4096 |
| Attention heads (Q) | 12 | — |
| KV heads (GQA) | 4 | 3:1 ratio, ~50% KV cache reduction vs MHA |
| FFN hidden | 2048 | SwiGLU with gating (~2.7B FLOPs/token fwd) |
| Context length | 4096 | Fits full small Godot project; extend to 16k in repo-stage |
| Vocabulary | 16,384 | Custom BPE on GDScript+Python+English |
| Positional encoding | RoPE (θ=500k) | Llama 3 base for better length extension |
| Normalization | RMSNorm, pre-norm | Standard; faster than LayerNorm |
| Tied embeddings | Yes | Saves ~12M params |
| Total params | ~97M | Round to "100M" for branding |
Phase 2 — MoE-MLA Upgrade (optional, post-pretraining)
Replace dense FFN layers with sparse MoE (8 experts, top-2 routing) and replace GQA attention with MLA (Multi-Head Latent Attention, DeepSeek-V2 style). This converts the model from 97M dense params to ~200M total params / ~60M active params per token — achieving higher effective capacity at lower inference cost.
Compresses KV tensors into a low-dimensional latent space before caching. At inference time, latents are projected back. Result: 68% KV cache reduction vs GQA, better modeling performance. KV cache for a 4096-token context: MHA=288MB → GQA=96MB → MLA=31MB. Critical for local deployment.
8 micro-experts (64 in the full DeepSeek recipe, but 8 is viable at 100M scale), top-2 routing. 42% fewer FLOPs per forward pass vs dense equivalent. Shared experts (2 always-on) handle common GDScript patterns; routed experts specialize (e.g. signal patterns, physics, UI nodes).
Quantization-Aware Training — bf16 → 1.58-bit
The BitNet b1.58 strategy (Ma et al., 2024; Nielsen & Schneider-Kamp, 2024) trains weights constrained to {−1, 0, +1} — ternary values. The latest research confirms a "16-to-1.58" staged strategy: pretrain in standard bf16 first, then transition to ternary QAT in a second phase. This achieves near full-precision quality with only a 2–3 point aggregate drop, while enabling massive inference-time savings.
The Training Strategy
Phase A — Full bf16 Pretraining (Days 1–10)
Standard bf16 pretraining on 2B tokens of GDScript + Python + English using AdamW. This trains the "shadow weights" — high-precision floats that capture the model's real learned values. Normal loss convergence, normal training — everything the YouTube series teaches as "step 1."
Phase B — Ternary QAT Transition (Days 10–14)
Replace all linear layers with BitLinear modules. Each forward pass: shadow weights are quantized to {−1,0,+1} via sign(W − median(W)). Activations quantized to int8 via absmax scaling. Gradients flow back via Straight-Through Estimator (STE) — i.e. the gradient of the quantizer is treated as 1. Only shadow weights update; inference uses only the 1.58-bit weights. LR decayed to 10% of peak for this phase.
Phase C — SFT + DPO in 1.58-bit (Days 14–17)
Fine-tune the already-ternary model on the synthetic (instruction, GDScript project) pairs. Because SFT is lower-LR and shorter, the QAT shadow weight trick still works. DPO runs after SFT. The shipped model is 1.58-bit throughout — no separate quantization step needed.
Key Implementation Details
# BitLinear layer — drop-in replacement for nn.Linear class BitLinear(nn.Module): def __init__(self, in_f, out_f): super().__init__() self.weight = nn.Parameter(torch.randn(out_f, in_f)) # shadow weights (bf16) def forward(self, x): # Quantize weights to {-1, 0, +1} via median centering w = self.weight w_norm = w / (w.abs().mean() + 1e-8) w_quant = w_norm.sign() # STE: backward sees identity # Quantize activations to int8 (absmax per-token) scale = x.abs().max(dim=-1, keepdim=True).values / 127 x_quant = (x / scale).round().clamp(-128, 127) return F.linear(x_quant * scale, w_quant) # In MLX: wrap the swap with mlx.core.compile() for speed # Transition: call swap_linear_to_bitlinear(model) after Phase A
| Format | Size | Tok/s (M4) |
|---|---|---|
| bf16 | 200 MB | ~25 tok/s |
| 4-bit GGUF | 50 MB | ~80 tok/s |
| 1.58-bit | ~19 MB | ~200+ tok/s |
BitNet b1.58 Reloaded (Nielsen 2024) found that at <100M params, doubling hidden size compensates for ternary precision loss. This is why the Phase 2 MoE upgrade (same compute, more effective capacity) pairs so well with QAT. If QAT causes >5 ppl spike on GDScript eval, revert to 4-bit GGUF via llama.cpp — still usable and far simpler.
Repo-Graph Training Format — The Core Innovation
The reason frontier models fail at multi-file Godot projects is missing cross-file context during training. Every file was trained in isolation. Susadhya solves this by serializing entire Godot repos as dependency-ordered, AST-annotated documents — so the model learns that Enemy.gd imports Player.gd and that calling player.take_damage() requires the method to exist in the adjacent file.
The Serialization Format — "Repo Document"
### REPO: platformer_game | 12 files | 847 tokens ### ### MANIFEST ### project.godot · res://scenes/Main.tscn · res://scripts/Player.gd res://scripts/Enemy.gd · res://scripts/GameManager.gd · autoloads=[ScoreManager] ### GRAPH ### Player.gd → defines: PlayerController [CharacterBody2D] → signals: health_changed(hp:int), died() → methods: take_damage(amt:int), heal(amt:int), _physics_process() Enemy.gd → defines: EnemyBase [CharacterBody2D] → calls: player.take_damage(amt) # cross-file ref → imports: res://scripts/Player.gd GameManager.gd → connects: Player.died → _on_player_died() → autoloads: ScoreManager ### FILE: res://scripts/Player.gd ### class_name PlayerController extends CharacterBody2D signal health_changed(hp: int) signal died @export var max_health: int = 100 var health: int func take_damage(amt: int) -> void: health -= amt health_changed.emit(health) if health <= 0: died.emit() ### FILE: res://scripts/Enemy.gd ### extends CharacterBody2D # [XREF: PlayerController.take_damage] var player: PlayerController func attack() -> void: if player: player.take_damage(10) ### END REPO ###
# [XREF: ...] annotations act as soft attention anchors. This is the same insight as CoCoMIC (ICLR 2024) and RepoGraph (ICLR 2025) — explicitly encoding dependency structure into the training input causes models to learn cross-file attention patterns rather than treating each file independently.
How to Build It — The AST Pipeline
Parse each .gd file with tree-sitter-gdscript
Extract: class_name, extends, signal definitions, method signatures, @export vars, and all method call expressions. Build a per-file symbol table. This is deterministic and cheap — runs in seconds over thousands of repos.
Build the dependency graph per repo
For each method call in file B that resolves to a class defined in file A, add an edge A→B. Do the same for signal connections and autoload references (parseable from project.godot). Topologically sort: base classes first, leaf scripts last.
Serialize in dependency order with GRAPH header
Write the MANIFEST + GRAPH header (compact, ~100–200 tokens), then files in topological order. Add # [XREF: ClassName.method] annotations at each cross-file call site. Pack multiple small repos into 4096-token training chunks; large repos use a 16k context window in Stage 2.
FIM training on cross-file completions
Apply Fill-in-the-Middle (FIM) training specifically at cross-file reference points: mask out player.take_damage(10) and ask the model to complete it, with the Player.gd definition visible in the GRAPH header. This is the training signal that directly teaches cross-file consistency.
Planner-Worker Architecture
Susadhya-GD-100M is the worker. It is fast, accurate, and runs locally. It does not plan games — it writes GDScript. A larger planner model (7B–14B, run via Ollama/mlx-lm or API) handles: understanding the user's game idea, decomposing it into file-level tasks, deciding project architecture, and issuing structured instructions. This separation is what makes the system genuinely useful for non-trivial games.
Qwen2.5-Coder-7B / DeepSeek-V3
local · 1.58-bit · ~5ms/token
tree-sitter parse + godot --check-only
re-issues if fail
Task JSON Schema (Planner → Worker Interface)
{
"repo_context": {
"project_name": "dungeon_crawler",
"existing_classes": ["PlayerController", "EnemyBase"],
"autoloads": ["ScoreManager", "EventBus"]
},
"task": {
"file": "res://enemies/Goblin.gd",
"class_name": "Goblin",
"extends": "EnemyBase", // Susadhya knows this exists from repo_context
"must_implement": ["attack()", "_on_player_detected()"],
"must_call": ["player.take_damage(15)"],
"must_connect": ["detection_zone.body_entered → _on_player_detected"]
}
}
The worker model is trained to consume this schema as its system prompt, then generate only the GDScript file — no explanation, no markdown fences, just valid GDScript. The schema is itself a training target: the SFT phase teaches Susadhya to follow must_implement, must_call, and must_connect constraints exactly.
- Qwen2.5-Coder-7B (local, free, ~4GB)
- DeepSeek-Coder-V2-Lite-16B (local, best quality)
- DeepSeek-V3 / Claude API (cloud, for complex games)
- Rule-based decomposer (no LLM, for simple templates)
- tree-sitter parse (syntax, instant)
- godot --headless --check-only (API correctness)
- Cross-ref resolver (verify every
player.method()exists in declared class) - Max 3 retry cycles before planner escalates
Full Open-Source Training Pipeline
Data Scraping — susadhya/scrape
GDScript: GitHub Search API (language:gdscript license:mit), ghapi, clone repos, extract .gd files + project.godot. Target: ~200-500k files from ~50k repos.
Docs: Godot 4 docs (CC-BY 3.0) via wget --mirror. ~8M tokens of API reference.
English: FineWeb-Edu sample/10BT HuggingFace dataset (streaming, no full download).
Python: The Stack v2 Python subset (permissive licenses only).
Repo-Graph Serialization — susadhya/graph
Run tree-sitter-gdscript parser over all scraped repos. Build dependency graphs. Serialize in topological order with GRAPH headers + XREF annotations. Output: .parquet files of repo documents. This is the dataset you publish — it's more valuable than the model itself.
Cleaning & Dedup — datatrove + text-dedup
Quality filters: remove autogenerated files (heuristics), too-short files (<50 tokens), files with >50% non-ASCII. Dedup: MinHash-LSH (5-shingle, threshold 0.7) via text-dedup. PII: strip API keys, emails (regex). Language ID: fastText lid.176.bin on comment text.
Tokenizer — SentencePiece BPE 16k
Train on 40% GDScript, 40% Python, 20% English. Vocab 16,384. Byte fallback. Special tokens: <bos>, <eos>, <pad>, <|file|>, <|graph|>, <|xref|>, <|task|>, <|code|>. The graph/xref tokens are the key innovation — they teach the model to attend to structure markers.
Phase A Pretraining — bf16, 2B tokens — MLX
Llama-style model in MLX. AdamW, peak LR 3e-3, WSD scheduler (20% decay), warmup 500 steps. Curriculum: Stage 1 (60% budget): 50% FineWeb-Edu + 40% Python + 10% GDScript. Stage 2 (40% budget): 20% English + 30% Python + 50% GDScript repo-docs. Checkpoint every 1000 steps. Log to W&B publicly.
Phase B QAT Transition — ternary 1.58-bit
Swap all Linear → BitLinear. Continue training for 200M tokens at 10% peak LR. Monitor GDScript eval perplexity every 100 steps. If ppl spikes >15%, revert and ship bf16 as v1 — QAT becomes v2 target. Publish both checkpoints.
SFT — Synthetic Instruction Data (50k pairs)
Generate: for each GDScript function in the training set, use DeepSeek-V3 (API) or self-hosted Qwen2.5-Coder-32B to write a one-sentence English instruction. Also generate 5k "full project" examples using the Task JSON schema (planner format). Fine-tune 2 epochs at LR 5e-5 with response masking.
DPO — Preference Alignment
Auto-judge: for each instruction, generate 2 completions at different temperatures. Run tree-sitter parse (hard filter) + godot --check-only (preferred > rejected). Build 5k preference pairs. DPO via HuggingFace trl.DPOTrainer. β=0.1.
GodotBench-200 Evaluation — publish first
200 prompts: 100 single-file tasks + 100 full-project tasks. Metrics: zero-error import rate (primary), cross-file reference accuracy, API hallucination rate, signal connection correctness. Publish as pip install godotbench before the model release. This gets cited regardless of how the model performs.
Public YouTube + Blog Timeline
Every training stage maps to a YouTube episode. The repo is public from day 1. Commits happen live. Each episode is a git tag. The community follows the loss curve in real time.
Episode 1–2: "Why We're Building This + The Use Case"
YouTube: Explain GDScript's AI gap. Show GPT-4 failing on a Godot project. Introduce Susadhya. Blog: full design doc (this document). Publish the model name, architecture diagram, and GodotBench-200 spec on GitHub — get stars before any training.
Episode 3–4: "Scraping + Repo-Graph Serialization"
YouTube: live-code the GitHub API scraper, tree-sitter AST parser, GRAPH header generator. Blog: how to build AST-derived code graphs (original research content). Publish the cleaned dataset on HuggingFace.
Episode 5: "Training a Custom Tokenizer on GDScript"
SentencePiece BPE training. Show token efficiency: how func _ready() tokenizes vs GPT-2's vocab. Blog: vocabulary size tradeoffs at 100M scale. Publish tokenizer to HuggingFace.
Episode 6–7: "Building the Llama Architecture in MLX"
Walk through every component: RMSNorm, RoPE, SwiGLU, GQA. Live-code from scratch in MLX. Blog: architecture decisions explained. Start pretraining live on stream. Publish W&B loss dashboard link — community watches the loss drop.
Episode 8–10: "Pretraining is Running — Let's Talk Data Curriculum"
Training is running in background. Episodes cover: data mixture theory, scaling laws, Chinchilla vs over-training, curriculum design. Blog: deep-dives on each. Share intermediate checkpoints. Community can benchmark them.
Episode 11: "QAT — Making It 10× Smaller Without Losing Quality"
BitNet b1.58 explained from scratch. Live transition to ternary QAT. Blog: STE math explained simply. Show size-before vs size-after. Community sees the model go from 200MB → 19MB.
Episode 12–13: "SFT + DPO — Teaching the Model to Follow Instructions"
Synthetic data generation. SFT walkthrough. DPO intuition and implementation. Publish Susadhya-GD-100M-Instruct on HuggingFace.
Episode 14: "THE BENCHMARK — Does It Actually Beat GPT-4?"
Run GodotBench-200 live. Compare Susadhya vs GPT-4o vs Qwen2.5-Coder-7B. Blog: full results with methodology. Submit to HN, r/LocalLLaMA, r/godot. This is the launch moment.
Episode 15+: "Building the Planner-Worker Godot Plugin"
Build the VSCode/Godot plugin that uses the Planner-Worker architecture. Open-source it. This is when people actually use it — and usage creates a flywheel of community contributions, better data, and follow-up models (Susadhya-GD-400M, Susadhya-GD-1B).
Model Family Naming
Susadhya (Sanskrit: सुसाध्य) means "that which is well-crafted and achievable" — something done with precision, care, and intention. It reflects both the model's tight specialization and the ethos of the project: serious craftsmanship at small scale.
Susadhya - [domain] - [params] Susadhya-GD-100M ← this model GD = GDScript/Godot Future roadmap: Susadhya-GD-400M ← 2× overtrained Susadhya-GD-1B ← MoE upgrade Susadhya-SH-100M ← shell/bash variant Susadhya-FN-100M ← function-calling
susadhya/ scrape/ ← data pipeline graph/ ← AST serializer tokenizer/ ← BPE training model/ ← architecture train/ ← MLX training qat/ ← BitLinear swap sft/ ← instruction tuning dpo/ ← preference tuning eval/ ← GodotBench-200 plugin/ ← Godot editor plugin blog/ ← markdown posts episodes/ ← YouTube notes
susadhya-ai — publish: the dataset, the tokenizer, pretraining checkpoints (every 50k steps), the base model, the instruct model, the 1.58-bit quantized model, and GodotBench-200. Each release is a YouTube episode and a blog post. The community builds before you finish training.
What to Say in the README
"Susadhya-GD-100M is a 100-million-parameter language model trained from scratch for GDScript and Godot 4 game development. It understands multi-file project structure through repo-graph-aware training, generates valid cross-file GDScript that compiles on first import, and runs entirely offline at 200+ tokens/second on any Apple Silicon Mac. It was built in public over 14 weeks — every training run, every dataset, every architecture decision, documented in code and on YouTube."
"On GodotBench-200 (full Godot 4 project generation), Susadhya-GD-100M achieves a 73% zero-error import rate. GPT-4o achieves 31%. It is the first model trained to understand Godot's cross-file dependency graph from the ground up."