Benchmark · Aeryx

lmkit-go vs lmkit: pure Go on XLA against PyTorch, same 8GB card Jun 25, 2026
Built an LLM trainer in pure Go on XLA to see if it could keep up with the PyTorch original. Ran both to a 2B-token Chinchilla budget on the same 8GB 3070 Ti, same 100M model, same config, same tokens, same seed. Go-on-XLA lands at 85% of PyTorch throughput at ~1.7x the memory, and the loss curves track within 0.1 the whole way. Three weeks ago the same stack was 26x slower and couldn’t fit the model at all.