aimlmlxapple-siliconmoe

Porting the First MoE Image Model (17B) to Apple Silicon

Nucleus-Image packs 17B parameters into a model that only activates 2B per token. Porting it to MLX meant solving expert-choice routing, CausalConv3d, and 13 bugs between black output and photorealism.

by Ritesh Khanna|@treadon

Nucleus-Image is a 17 billion parameter text-to-image model, and the first open MoE image model I’m aware of. It uses Mixture-of-Experts to pack 17B parameters into a model that only activates ~2B per token. The idea: get the quality of a large model at the cost of a small one.

There’s no MLX port for it. Most MLX image work targets dense models like FLUX or Stable Diffusion. MoE image models are new, and porting one means solving problems nobody has solved before in MLX: expert-choice routing, packed expert weights, capacity-based token dispatch, and a VAE that uses CausalConv3d.

This is the story of porting it. It took 13 bugs to get from black output to photorealistic images.

Why MoE for Images?

Dense models use every parameter for every token. MoE models route each token to a small subset of “expert” subnetworks. This means you can scale parameters without scaling compute:

Total vs Active Parameters

Nucleus-Image has 17B parameters but only activates 2B per token. That’s less than ERNIE’s 8B dense.

The trade-off: MoE models need more memory (all 17B parameters must be loaded) but less compute per forward pass. This is perfect for Apple Silicon, where unified memory is plentiful (64-128GB) but GPU compute is limited. MLX’s 4-bit quantization makes the memory manageable.

The Pipeline

Click a component for details

The text encoder stays in PyTorch. Qwen3-VL-8B uses a complex architecture with vision-language features that would take weeks to reimplement in MLX, and it runs in ~2 seconds. Not worth porting. Everything else runs in MLX.

The DiT is the interesting part. Each of the 29 MoE layers has 64 routed experts and 1 shared expert. The routing is “expert-choice”: instead of each token picking its top-2 experts (like GPT-4), each expert picks its top-C tokens. C is the “capacity”: how many tokens each expert can handle, determined by a capacity factor from the config.

The Port: Weight Loading

Step one is always the same: define the MLX architecture so that every layer name matches the PyTorch weight names exactly. If names match, weights load with zero mapping code.

Weight Loading
DiT (17B)
610/610PERFECT
VAE (254MB)
108/108PERFECT
718 weight tensors across 7 safetensors shards (~34GB total).

Weights loaded perfectly. The output was black.

13 Bugs to Photorealism

Getting weights to load is the easy part. Getting the model to produce correct output took finding and fixing 13 separate bugs across three debugging sessions. Each fix moved the output from black → gray → noisy color → over-saturated → photorealistic.

The hardest bug was the VAE. The original model uses CausalConv3d for video support. For single-frame images, you’d think only the center temporal slice of the 3D kernel matters. Wrong. Causal convolutions pad before the frame, not symmetrically. The padding is (2p, 0), which means the input is [0, 0, x] and only the last kernel slice fires. The center slice had 20× less weight energy. The VAE was producing near-zero output for any input.

The most surprising bug was the negative embeddings. Classifier-free guidance (CFG) works by computing neg + scale × (pos - neg). The reference encodes an empty string as the negative, which produces a non-zero embedding with L2 norm of 15,824. We used zero vectors. That’s not “no guidance.” It’s guidance in a completely wrong direction. The result was over-saturated images with cyan fringing at every edge.

Debugging Method: Drop-In Testing

The breakthrough debugging technique: start with the fully working reference (diffusers) pipeline and swap in one MLX component at a time until quality degrades.

Reference latents → MLX VAE

Pixel-perfect match. VAE is correct.

Reference tokens → MLX unpatchify + denorm + VAE

Pixel-perfect match. Post-processing is correct.

PyTorch DiT → our scheduler/CFG

Still over-saturated! Bug is in scheduler or CFG, not the DiT.

This immediately ruled out the DiT as the source of the quality issues and pointed directly to the sigma schedule (no shift needed) and negative embeddings (encode empty string, not zeros).

Precision and MoE

Even with all bugs fixed, there’s a subtle quality difference from the reference. The reason: precision compounding across 29 MoE layers.

Output Correlation vs Reference (by block)

Dense blocks (0-2) match at 0.9999. Each MoE block drops ~1-2%, compounding to ~0.75 by block 31.

Each MoE block introduces ~1% error from bfloat16 precision differences in the routing decisions. 96% of token-to-expert assignments match the reference exactly. The 4% that differ come from slightly different softmax scores causing different top-C selections. Over 29 blocks, this compounds to ~0.75 overall correlation.

In practice, the output is visually identical to the reference. The precision difference shows up as a slightly different “interpretation” of the prompt, not as artifacts or degradation.

Performance

512×512, 20 steps, M4 Pro 64GB

bf16 is ~12% faster than 4-bit. Memory savings of 4-bit matter more for smaller machines.

Unlike LLM inference, where quantization usually speeds things up, 4-bit is slightly slower than bf16 here. The dequantization overhead on the attention and modulation projections outweighs the memory bandwidth savings (the expert weights, which are the bulk of the model, stay in bf16 regardless).

So why use 4-bit at all? Memory. The full bf16 model takes ~34GB of RAM just for weights. Add the text encoder (~16GB) and VAE (~1GB) and you’re at 50GB. On a 64GB machine this fits but barely. 4-bit quantization brings the DiT down to ~8GB, total footprint ~25GB. That’s the difference between “runs on my 32GB laptop” and “needs a 64GB machine.” MLX’s built-in nn.quantize() makes this a one-line change.

The Results

512×512, 30 steps, CFG 4.0, 4-bit quantized on M4 Pro:

Red apple
A red apple on a white table
Golden retriever puppy
Golden retriever puppy in autumn leaves
Coffee on windowsill
Coffee on a rainy windowsill

What I Learned

  • MoE image models are different from MoE language models. Expert-choice routing (experts pick tokens) vs token-choice (tokens pick experts) changes the entire dispatch implementation. And the SwiGLU split convention differs between dense FFN and routed experts in the same model.
  • CausalConv3d is not Conv3d with center slice. Causal padding is one-sided. For T=1, the last kernel slice fires, not the center. This was a 20× magnitude error that made the VAE look broken.
  • Drop-in debugging is powerful. Swapping one component at a time between reference and port immediately isolates where quality degrades. It found two bugs in 10 minutes that I’d spent hours on.
  • Negative embeddings matter enormously for CFG. Zero vectors are not “unconditional.” The encoded empty string has L2 norm of 15,824. Using zeros makes CFG amplify the raw signal instead of the prompt-specific signal.
  • Precision compounds in MoE. Each MoE block introduces ~1% error from routing differences. Over 29 blocks, this compounds. Dense blocks are nearly exact (0.9999 correlation). The MoE routing’s sensitivity to softmax precision is the bottleneck.