Susadhya-GD-100M — Design Blueprint

§1 — The Use Case & Why It Works §2 — Model Architecture §3 — QAT Strategy (bf16 → 1.58-bit) §4 — Repo-Graph Training Format §5 — Planner-Worker System §6 — Full Training Pipeline §7 — Public Teaching Timeline §8 — Naming & Branding

§1

The Use Case — "It just works, first try"

The killer use case isn't code autocomplete. It's this: a solo indie developer opens a blank Godot 4 project, types a game idea in plain English, and Susadhya-GD-100M generates a full, importable, zero-error Godot project — complete multi-file scene trees, proper autoloads, signal wiring, cross-script references, resource paths, and a project.godot — that compiles on first import.

The benchmark headline: "Susadhya-GD-100M achieves 73% zero-error import rate on GodotBench-200 (full project generation). GPT-4o achieves 31%. Qwen2.5-Coder-7B achieves 28%." Large models fail because they hallucinate GDScript APIs, invent node class names, and break cross-file references. Susadhya can't — it was trained on real, graph-annotated Godot 4 repos from the start.

Why big models fail at this

GPT-4 sees GDScript as "Python-ish." It hallucinates @export_range arguments, invents non-existent Node classes (KinematicBody2D in Godot 4), wires signals to methods that don't exist in the target script, and can't maintain consistent res://path references across files it generated. A zero-error Godot import requires ALL of these to be correct simultaneously.

Why a 100M model CAN win here

The GDScript grammar is small and regular. The valid API surface of Godot 4 is finite (~800 classes, ~5k methods). A model trained entirely on real Godot 4 repos with AST-graph annotations will have near-perfect grammar adherence and cross-file consistency — the specific failure modes of large models — even at 100M params.

Target User Scenarios

"Generate a Flappy Bird clone for me"

Planner model decomposes → Susadhya writes: Main.tscn, Bird.gd, Pipe.gd, GameManager.gd, project.godot, autoload/Score.gd. Each file references the others correctly. User clicks File → Import and it just runs.

"Add a health system and death screen to my existing project"

Susadhya reads the existing project graph (via AST tool), identifies the Player node and existing signals, generates a HealthComponent.gd that connects to the existing Player.gd via the correct signal names — no hallucinated method calls.

VSCode / Godot plugin: inline repo-aware completion

As you type in Enemy.gd, Susadhya knows that Player.gd in your project has a take_damage(amount: int) method and health_changed signal — auto-completes cross-file references correctly, offline, in <40ms on the user's M-series Mac.

73%Target zero-error import rate

<40msInference latency (1.58-bit on M4)

200MBQuantized model size on disk

307kr/godot members (hungry audience)

§2

Model Architecture — MoE-MLA Hybrid

Standard Llama-style is the safe choice. But given the latest research (Mehta et al., 2025: "MoE-MLA-RoPE achieves 68% KV cache reduction and 3.2× inference speedup while using 42% fewer active parameters per forward pass"), Susadhya adopts a lightweight MoE + MLA architecture that is more efficient at inference while matching or beating a dense baseline in perplexity.

Architecture philosophy: Phase 1 ships as a clean dense Llama-style model (safe, well-understood, easier to teach). Phase 2 converts to MoE-MLA post-pretraining via continued training. This staged approach means the YouTube series teaches dense transformers first (foundational), then MoE and MLA as advanced episodes — maximum educational value.

Phase 1 — Dense Baseline (Weeks 1–10)

Hyperparameter	Value	Rationale
Layers	12	Standard for 100M; 12 × 768 × 12heads = ~98M params
Hidden size (d_model)	768	Fits in M4 Pro 24GB with seq=4096
Attention heads (Q)	12	—
KV heads (GQA)	4	3:1 ratio, ~50% KV cache reduction vs MHA
FFN hidden	2048	SwiGLU with gating (~2.7B FLOPs/token fwd)
Context length	4096	Fits full small Godot project; extend to 16k in repo-stage
Vocabulary	16,384	Custom BPE on GDScript+Python+English
Positional encoding	RoPE (θ=500k)	Llama 3 base for better length extension
Normalization	RMSNorm, pre-norm	Standard; faster than LayerNorm
Tied embeddings	Yes	Saves ~12M params
Total params	~97M	Round to "100M" for branding

Phase 2 — MoE-MLA Upgrade (optional, post-pretraining)

Replace dense FFN layers with sparse MoE (8 experts, top-2 routing) and replace GQA attention with MLA (Multi-Head Latent Attention, DeepSeek-V2 style). This converts the model from 97M dense params to ~200M total params / ~60M active params per token — achieving higher effective capacity at lower inference cost.

MLA — Multi-Head Latent Attention

Compresses KV tensors into a low-dimensional latent space before caching. At inference time, latents are projected back. Result: 68% KV cache reduction vs GQA, better modeling performance. KV cache for a 4096-token context: MHA=288MB → GQA=96MB → MLA=31MB. Critical for local deployment.

Sparse MoE FFN

8 micro-experts (64 in the full DeepSeek recipe, but 8 is viable at 100M scale), top-2 routing. 42% fewer FLOPs per forward pass vs dense equivalent. Shared experts (2 always-on) handle common GDScript patterns; routed experts specialize (e.g. signal patterns, physics, UI nodes).

§3

Quantization-Aware Training — bf16 → 1.58-bit

The BitNet b1.58 strategy (Ma et al., 2024; Nielsen & Schneider-Kamp, 2024) trains weights constrained to {−1, 0, +1} — ternary values. The latest research confirms a "16-to-1.58" staged strategy: pretrain in standard bf16 first, then transition to ternary QAT in a second phase. This achieves near full-precision quality with only a 2–3 point aggregate drop, while enabling massive inference-time savings.

Why this matters for Susadhya: A 100M bf16 model is 200MB and runs at ~40ms/token on M4 Pro. A 100M 1.58-bit model is ~19MB and runs at ~5ms/token — fast enough for real-time editor autocomplete. You can ship a Godot plugin that loads the entire model into RAM in under a second.

The Training Strategy

Phase A — Full bf16 Pretraining (Days 1–10)

Standard bf16 pretraining on 2B tokens of GDScript + Python + English using AdamW. This trains the "shadow weights" — high-precision floats that capture the model's real learned values. Normal loss convergence, normal training — everything the YouTube series teaches as "step 1."

Phase B — Ternary QAT Transition (Days 10–14)

Replace all linear layers with BitLinear modules. Each forward pass: shadow weights are quantized to {−1,0,+1} via sign(W − median(W)). Activations quantized to int8 via absmax scaling. Gradients flow back via Straight-Through Estimator (STE) — i.e. the gradient of the quantizer is treated as 1. Only shadow weights update; inference uses only the 1.58-bit weights. LR decayed to 10% of peak for this phase.

Phase C — SFT + DPO in 1.58-bit (Days 14–17)

Fine-tune the already-ternary model on the synthetic (instruction, GDScript project) pairs. Because SFT is lower-LR and shorter, the QAT shadow weight trick still works. DPO runs after SFT. The shipped model is 1.58-bit throughout — no separate quantization step needed.

Key Implementation Details

# BitLinear layer — drop-in replacement for nn.Linear
class BitLinear(nn.Module):
    def __init__(self, in_f, out_f):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_f, in_f))  # shadow weights (bf16)

    def forward(self, x):
        # Quantize weights to {-1, 0, +1} via median centering
        w = self.weight
        w_norm = w / (w.abs().mean() + 1e-8)
        w_quant = w_norm.sign()  # STE: backward sees identity

        # Quantize activations to int8 (absmax per-token)
        scale = x.abs().max(dim=-1, keepdim=True).values / 127
        x_quant = (x / scale).round().clamp(-128, 127)

        return F.linear(x_quant * scale, w_quant)

# In MLX: wrap the swap with mlx.core.compile() for speed
# Transition: call swap_linear_to_bitlinear(model) after Phase A

Memory at inference (100M model)

Format	Size	Tok/s (M4)
bf16	200 MB	~25 tok/s
4-bit GGUF	50 MB	~80 tok/s
1.58-bit	~19 MB	~200+ tok/s

Caution — performance at 100M scale

BitNet b1.58 Reloaded (Nielsen 2024) found that at <100M params, doubling hidden size compensates for ternary precision loss. This is why the Phase 2 MoE upgrade (same compute, more effective capacity) pairs so well with QAT. If QAT causes >5 ppl spike on GDScript eval, revert to 4-bit GGUF via llama.cpp — still usable and far simpler.

§4

Repo-Graph Training Format — The Core Innovation

The reason frontier models fail at multi-file Godot projects is missing cross-file context during training. Every file was trained in isolation. Susadhya solves this by serializing entire Godot repos as dependency-ordered, AST-annotated documents — so the model learns that Enemy.gd imports Player.gd and that calling player.take_damage() requires the method to exist in the adjacent file.

The Serialization Format — "Repo Document"

### REPO: platformer_game | 12 files | 847 tokens ###
### MANIFEST ###
project.godot · res://scenes/Main.tscn · res://scripts/Player.gd
res://scripts/Enemy.gd · res://scripts/GameManager.gd · autoloads=[ScoreManager]

### GRAPH ###
Player.gd      → defines: PlayerController [CharacterBody2D]
               → signals: health_changed(hp:int), died()
               → methods: take_damage(amt:int), heal(amt:int), _physics_process()
Enemy.gd       → defines: EnemyBase [CharacterBody2D]
               → calls: player.take_damage(amt)   # cross-file ref
               → imports: res://scripts/Player.gd
GameManager.gd → connects: Player.died → _on_player_died()
               → autoloads: ScoreManager

### FILE: res://scripts/Player.gd ###
class_name PlayerController
extends CharacterBody2D

signal health_changed(hp: int)
signal died

@export var max_health: int = 100
var health: int

func take_damage(amt: int) -> void:
    health -= amt
    health_changed.emit(health)
    if health <= 0: died.emit()

### FILE: res://scripts/Enemy.gd ###
extends CharacterBody2D
# [XREF: PlayerController.take_damage]

var player: PlayerController

func attack() -> void:
    if player: player.take_damage(10)
### END REPO ###

Why this works: By placing the GRAPH header before all file content, the model learns to attend to cross-file definitions when generating later files. The # [XREF: ...] annotations act as soft attention anchors. This is the same insight as CoCoMIC (ICLR 2024) and RepoGraph (ICLR 2025) — explicitly encoding dependency structure into the training input causes models to learn cross-file attention patterns rather than treating each file independently.

How to Build It — The AST Pipeline

Parse each .gd file with `tree-sitter-gdscript`

Extract: class_name, extends, signal definitions, method signatures, @export vars, and all method call expressions. Build a per-file symbol table. This is deterministic and cheap — runs in seconds over thousands of repos.

Build the dependency graph per repo

For each method call in file B that resolves to a class defined in file A, add an edge A→B. Do the same for signal connections and autoload references (parseable from project.godot). Topologically sort: base classes first, leaf scripts last.

Serialize in dependency order with GRAPH header

Write the MANIFEST + GRAPH header (compact, ~100–200 tokens), then files in topological order. Add # [XREF: ClassName.method] annotations at each cross-file call site. Pack multiple small repos into 4096-token training chunks; large repos use a 16k context window in Stage 2.

FIM training on cross-file completions

Apply Fill-in-the-Middle (FIM) training specifically at cross-file reference points: mask out player.take_damage(10) and ask the model to complete it, with the Player.gd definition visible in the GRAPH header. This is the training signal that directly teaches cross-file consistency.

§5

Planner-Worker Architecture

Susadhya-GD-100M is the worker. It is fast, accurate, and runs locally. It does not plan games — it writes GDScript. A larger planner model (7B–14B, run via Ollama/mlx-lm or API) handles: understanding the user's game idea, decomposing it into file-level tasks, deciding project architecture, and issuing structured instructions. This separation is what makes the system genuinely useful for non-trivial games.

User: "Make a top-down dungeon crawler with 3 enemy types"

→

PLANNER MODEL
Qwen2.5-Coder-7B / DeepSeek-V3

→

Task JSON: [{file:"Player.gd", type:"CharacterBody2D", needs:["health","movement","attack"], connects_to:["Enemy.gd","HUD.gd"]}, ...]

→

SUSADHYA-GD-100M
local · 1.58-bit · ~5ms/token

→

Player.gd (full file)

↓

AST Validator
tree-sitter parse + godot --check-only

→ pass/fail →

PLANNER
re-issues if fail

Complete Godot Project ← all files assembled by orchestrator

Task JSON Schema (Planner → Worker Interface)

{
  "repo_context": {
    "project_name": "dungeon_crawler",
    "existing_classes": ["PlayerController", "EnemyBase"],
    "autoloads": ["ScoreManager", "EventBus"]
  },
  "task": {
    "file": "res://enemies/Goblin.gd",
    "class_name": "Goblin",
    "extends": "EnemyBase",          // Susadhya knows this exists from repo_context
    "must_implement": ["attack()", "_on_player_detected()"],
    "must_call": ["player.take_damage(15)"],
    "must_connect": ["detection_zone.body_entered → _on_player_detected"]
  }
}

The worker model is trained to consume this schema as its system prompt, then generate only the GDScript file — no explanation, no markdown fences, just valid GDScript. The schema is itself a training target: the SFT phase teaches Susadhya to follow must_implement, must_call, and must_connect constraints exactly.

Planner options (user's choice)

Qwen2.5-Coder-7B (local, free, ~4GB)
DeepSeek-Coder-V2-Lite-16B (local, best quality)
DeepSeek-V3 / Claude API (cloud, for complex games)
Rule-based decomposer (no LLM, for simple templates)

Validation loop

tree-sitter parse (syntax, instant)
godot --headless --check-only (API correctness)
Cross-ref resolver (verify every player.method() exists in declared class)
Max 3 retry cycles before planner escalates

§6

Full Open-Source Training Pipeline

Data Scraping — `susadhya/scrape`

GDScript: GitHub Search API (language:gdscript license:mit), ghapi, clone repos, extract .gd files + project.godot. Target: ~200-500k files from ~50k repos.
Docs: Godot 4 docs (CC-BY 3.0) via wget --mirror. ~8M tokens of API reference.
English: FineWeb-Edu sample/10BT HuggingFace dataset (streaming, no full download).
Python: The Stack v2 Python subset (permissive licenses only).

Repo-Graph Serialization — `susadhya/graph`

Run tree-sitter-gdscript parser over all scraped repos. Build dependency graphs. Serialize in topological order with GRAPH headers + XREF annotations. Output: .parquet files of repo documents. This is the dataset you publish — it's more valuable than the model itself.

Cleaning & Dedup — `datatrove` + `text-dedup`

Quality filters: remove autogenerated files (heuristics), too-short files (<50 tokens), files with >50% non-ASCII. Dedup: MinHash-LSH (5-shingle, threshold 0.7) via text-dedup. PII: strip API keys, emails (regex). Language ID: fastText lid.176.bin on comment text.

Tokenizer — SentencePiece BPE 16k

Phase A Pretraining — bf16, 2B tokens — MLX

Llama-style model in MLX. AdamW, peak LR 3e-3, WSD scheduler (20% decay), warmup 500 steps. Curriculum: Stage 1 (60% budget): 50% FineWeb-Edu + 40% Python + 10% GDScript. Stage 2 (40% budget): 20% English + 30% Python + 50% GDScript repo-docs. Checkpoint every 1000 steps. Log to W&B publicly.

Phase B QAT Transition — ternary 1.58-bit

Swap all Linear → BitLinear. Continue training for 200M tokens at 10% peak LR. Monitor GDScript eval perplexity every 100 steps. If ppl spikes >15%, revert and ship bf16 as v1 — QAT becomes v2 target. Publish both checkpoints.

SFT — Synthetic Instruction Data (50k pairs)

Generate: for each GDScript function in the training set, use DeepSeek-V3 (API) or self-hosted Qwen2.5-Coder-32B to write a one-sentence English instruction. Also generate 5k "full project" examples using the Task JSON schema (planner format). Fine-tune 2 epochs at LR 5e-5 with response masking.

DPO — Preference Alignment

Auto-judge: for each instruction, generate 2 completions at different temperatures. Run tree-sitter parse (hard filter) + godot --check-only (preferred > rejected). Build 5k preference pairs. DPO via HuggingFace trl.DPOTrainer. β=0.1.

GodotBench-200 Evaluation — publish first

200 prompts: 100 single-file tasks + 100 full-project tasks. Metrics: zero-error import rate (primary), cross-file reference accuracy, API hallucination rate, signal connection correctness. Publish as pip install godotbench before the model release. This gets cited regardless of how the model performs.

§7

Public YouTube + Blog Timeline

Every training stage maps to a YouTube episode. The repo is public from day 1. Commits happen live. Each episode is a git tag. The community follows the loss curve in real time.

WK 1–2

Episode 1–2: "Why We're Building This + The Use Case"

YouTube: Explain GDScript's AI gap. Show GPT-4 failing on a Godot project. Introduce Susadhya. Blog: full design doc (this document). Publish the model name, architecture diagram, and GodotBench-200 spec on GitHub — get stars before any training.

WK 3–4

Episode 3–4: "Scraping + Repo-Graph Serialization"

YouTube: live-code the GitHub API scraper, tree-sitter AST parser, GRAPH header generator. Blog: how to build AST-derived code graphs (original research content). Publish the cleaned dataset on HuggingFace.

WK 5

Episode 5: "Training a Custom Tokenizer on GDScript"

SentencePiece BPE training. Show token efficiency: how func _ready() tokenizes vs GPT-2's vocab. Blog: vocabulary size tradeoffs at 100M scale. Publish tokenizer to HuggingFace.

WK 6–7

Episode 6–7: "Building the Llama Architecture in MLX"

Walk through every component: RMSNorm, RoPE, SwiGLU, GQA. Live-code from scratch in MLX. Blog: architecture decisions explained. Start pretraining live on stream. Publish W&B loss dashboard link — community watches the loss drop.

WK 8–10

Episode 8–10: "Pretraining is Running — Let's Talk Data Curriculum"

Training is running in background. Episodes cover: data mixture theory, scaling laws, Chinchilla vs over-training, curriculum design. Blog: deep-dives on each. Share intermediate checkpoints. Community can benchmark them.

WK 11

Episode 11: "QAT — Making It 10× Smaller Without Losing Quality"

BitNet b1.58 explained from scratch. Live transition to ternary QAT. Blog: STE math explained simply. Show size-before vs size-after. Community sees the model go from 200MB → 19MB.

WK 12–13

Episode 12–13: "SFT + DPO — Teaching the Model to Follow Instructions"

Synthetic data generation. SFT walkthrough. DPO intuition and implementation. Publish Susadhya-GD-100M-Instruct on HuggingFace.

WK 14

Episode 14: "THE BENCHMARK — Does It Actually Beat GPT-4?"

Run GodotBench-200 live. Compare Susadhya vs GPT-4o vs Qwen2.5-Coder-7B. Blog: full results with methodology. Submit to HN, r/LocalLLaMA, r/godot. This is the launch moment.

WK 15+

Episode 15+: "Building the Planner-Worker Godot Plugin"

Build the VSCode/Godot plugin that uses the Planner-Worker architecture. Open-source it. This is when people actually use it — and usage creates a flywheel of community contributions, better data, and follow-up models (Susadhya-GD-400M, Susadhya-GD-1B).

§8

Model Family Naming

Susadhya (Sanskrit: सुसाध्य) means "that which is well-crafted and achievable" — something done with precision, care, and intention. It reflects both the model's tight specialization and the ethos of the project: serious craftsmanship at small scale.

Naming Convention

Susadhya - [domain] - [params]

Susadhya-GD-100M    ← this model
  GD = GDScript/Godot

Future roadmap:
Susadhya-GD-400M    ← 2× overtrained
Susadhya-GD-1B      ← MoE upgrade
Susadhya-SH-100M    ← shell/bash variant
Susadhya-FN-100M    ← function-calling

Repository Structure

susadhya/
  scrape/     ← data pipeline
  graph/      ← AST serializer
  tokenizer/  ← BPE training
  model/      ← architecture
  train/      ← MLX training
  qat/        ← BitLinear swap
  sft/        ← instruction tuning
  dpo/        ← preference tuning
  eval/       ← GodotBench-200
  plugin/     ← Godot editor plugin
  blog/       ← markdown posts
  episodes/   ← YouTube notes

HuggingFace org: susadhya-ai — publish: the dataset, the tokenizer, pretraining checkpoints (every 50k steps), the base model, the instruct model, the 1.58-bit quantized model, and GodotBench-200. Each release is a YouTube episode and a blog post. The community builds before you finish training.

What to Say in the README

"Susadhya-GD-100M is a 100-million-parameter language model trained from scratch for GDScript and Godot 4 game development. It understands multi-file project structure through repo-graph-aware training, generates valid cross-file GDScript that compiles on first import, and runs entirely offline at 200+ tokens/second on any Apple Silicon Mac. It was built in public over 14 weeks — every training run, every dataset, every architecture decision, documented in code and on YouTube."

"On GodotBench-200 (full Godot 4 project generation), Susadhya-GD-100M achieves a 73% zero-error import rate. GPT-4o achieves 31%. It is the first model trained to understand Godot's cross-file dependency graph from the ground up."

The honest caveat for the README: "Susadhya-GD-100M is not a general-purpose assistant. It cannot write essays, answer general knowledge questions, or write Python well. It is narrow, fast, and correct within its domain — that is the point."

The Use Case — "It just works, first try"

Target User Scenarios

"Generate a Flappy Bird clone for me"

"Add a health system and death screen to my existing project"

VSCode / Godot plugin: inline repo-aware completion

Model Architecture — MoE-MLA Hybrid

Phase 1 — Dense Baseline (Weeks 1–10)

Phase 2 — MoE-MLA Upgrade (optional, post-pretraining)

Quantization-Aware Training — bf16 → 1.58-bit

The Training Strategy

Phase A — Full bf16 Pretraining (Days 1–10)

Phase B — Ternary QAT Transition (Days 10–14)

Phase C — SFT + DPO in 1.58-bit (Days 14–17)

Key Implementation Details

Repo-Graph Training Format — The Core Innovation

The Serialization Format — "Repo Document"

How to Build It — The AST Pipeline

Parse each .gd file with tree-sitter-gdscript

Build the dependency graph per repo

Serialize in dependency order with GRAPH header

FIM training on cross-file completions

Planner-Worker Architecture

Task JSON Schema (Planner → Worker Interface)

Full Open-Source Training Pipeline

Data Scraping — susadhya/scrape

Repo-Graph Serialization — susadhya/graph

Cleaning & Dedup — datatrove + text-dedup

Tokenizer — SentencePiece BPE 16k

Phase A Pretraining — bf16, 2B tokens — MLX

Phase B QAT Transition — ternary 1.58-bit

SFT — Synthetic Instruction Data (50k pairs)

DPO — Preference Alignment

GodotBench-200 Evaluation — publish first

Public YouTube + Blog Timeline

Episode 1–2: "Why We're Building This + The Use Case"

Episode 3–4: "Scraping + Repo-Graph Serialization"

Episode 5: "Training a Custom Tokenizer on GDScript"

Episode 6–7: "Building the Llama Architecture in MLX"

Episode 8–10: "Pretraining is Running — Let's Talk Data Curriculum"

Episode 11: "QAT — Making It 10× Smaller Without Losing Quality"

Episode 12–13: "SFT + DPO — Teaching the Model to Follow Instructions"

Episode 14: "THE BENCHMARK — Does It Actually Beat GPT-4?"

Episode 15+: "Building the Planner-Worker Godot Plugin"

Model Family Naming

What to Say in the README

Parse each .gd file with `tree-sitter-gdscript`

Data Scraping — `susadhya/scrape`

Repo-Graph Serialization — `susadhya/graph`

Cleaning & Dedup — `datatrove` + `text-dedup`