ScalarLM v1.0

Apr 3, 2026 — Greg Diamos

The Training Loop Is the Product: ScalarLM 1.0 for Researchers Working Beyond the LLM Frontier

Most ML infrastructure is optimized for a world that no longer reflects cutting-edge research. Serve a fixed model. Train a new one offline. Repeat. But if your work involves online learning, RLHF, self-improving agents, hybrid architectures, or any system where the line between inference and training is meant to dissolve — that separation is the problem, not the solution.

ScalarLM 1.0 was built to close that loop.

One Deployment, Both Directions

ScalarLM unifies inference and training into a single deployment. A live model exposes an OpenAI-compatible endpoint backed by vLLM's PagedAttention and continuous batching. That same deployment accepts training jobs dispatched via a Slurm scheduler running inside Kubernetes, powered by Megatron-LM's tensor and pipeline parallelism. The seam between them is a shared checkpoint: Megatron writes to it, vLLM loads from it — automatically, at the next inference request, with no restart.

This isn't a workflow convenience. It's an architectural primitive. If you're studying self-improvement dynamics, building online RLHF pipelines, or investigating how a model's output distribution shifts under repeated post-training, ScalarLM lets you close that loop experimentally — not theoretically.

Why This Matters If You're Not Working on LLMs

The platform's name is narrower than its design. The supported model table already signals the direction: Qwen3.5's Gated Delta Networks, NVIDIA's Nemotron 3 Super with its hybrid Mamba-Transformer MoE architecture, and OpenAI's GPT-OSS sparse MoE models are all first-class citizens. The inference backend supports MoE natively on both NVIDIA and AMD hardware. The 1M-context Nemotron 3 Super is optimized for multi-agent workloads — a setting where the boundary between a model's past outputs and its future inputs is structurally meaningful.

For researchers in state space models, sparse architectures, or long-context regimes, the relevant question isn't "does ScalarLM support my model class?" It's "does the training loop I'm building require live feedback from a running model?" If yes, ScalarLM provides the scaffolding.

The Stack Is Yours

ScalarLM ships CC-0. No attribution, no licensing friction, no vendor lock-in of any kind. Fork it, publish with it, embed it in a product. More practically for researchers: the training pipeline runs from a local ml/ directory that is packaged and shipped to the cluster automatically at job submission. Modify the training loop, loss function, optimizer, or data loader in your editor — the cluster picks it up without a Docker rebuild or redeployment. Every layer is visible and editable with normal tools and version control.

There are no black boxes. That's not a slogan; it's a design constraint.

Scaling Numbers That Hold at Production Sizes

ScalarLM achieves near-perfect weak scaling across GPU counts: 97.6% efficiency at 32 GPUs, 96.8% at 64, 95.2% at 128, and 94.2% at 256 — delivering approximately 241× effective throughput against a theoretical ideal of 256×. This is not a synthetic benchmark. These numbers reflect real multi-node Kubernetes deployments running production workloads.

The platform runs identically on NVIDIA Turing through Blackwell, AMD MI300X, ARM (including M2 MacBooks for local development), and x86 CPU — with zero code changes across targets. Deployments on TensorWave's AMD infrastructure use the same Helm charts as NVIDIA A100 clusters. Your experimental results do not get tied to a single vendor's hardware availability or pricing.

Getting Started

git clone https://github.com/supermassive-intelligence/scalarlm.git
cd scalarlm
./scalarlm up

Prebuilt containers are available for every supported target. Cloud deployments run on TensorWave. Enterprise support and production integration are available through RelationalAI. Full documentation, architecture diagrams, and deployment guides live at scalarlm.com.