Introduction

ScalarLM is an open-source platform for closed-loop LLM experimentation — running a model and post-training it against live feedback in the same deployment, across GPU hardware from any vendor, at scales from a single workstation to multi-node Kubernetes clusters.

It is CC-0 licensed. You can fork it, publish with it, build on it, and ship it without restriction or attribution.

Why ScalarLM

Most training and inference infrastructure is designed for one direction of work: either serving a fixed model, or producing a new one. Closing the loop — using a live model's outputs to drive the next round of post-training — typically requires stitching together separate systems with different APIs, checkpoint formats, and scheduling assumptions.

ScalarLM is built around that loop as the primary use case. A single deployment exposes both an inference endpoint and a training endpoint. You can query the running model, construct training signal from the results, and submit a post-training job against the same deployment without touching infrastructure. The updated checkpoint is picked up automatically at the next inference request.

This makes it well-suited for research in online learning, RLHF pipelines, self-improvement, and any setting where the boundary between inference and training needs to be thin.

Architecture

ScalarLM composes three production-grade components, each responsible for a distinct part of the stack:

vLLM handles live inference with PagedAttention, continuous batching, and high token throughput. Each deployment exposes an OpenAI-compatible endpoint backed by vLLM.
Megatron-LM handles distributed training, providing tensor and pipeline parallelism for scaling across multiple GPUs and nodes. Training jobs are dispatched via a Slurm scheduler running inside the Kubernetes deployment.
Hugging Face Hub is the model source and, optionally, the model sink. Any Hub-hosted model can be deployed; post-training checkpoints can be pushed back to the Hub automatically at the end of a training run.

The seam between them is a shared checkpoint system. vLLM loads from a checkpoint; Megatron writes to one. The inference pod does not need to restart for the update to take effect.

A full architecture diagram is on the Architecture page.

GPU Agnostic

ScalarLM runs on NVIDIA and AMD GPUs without code changes. The training stack is built on PyTorch, and the inference stack inherits vLLM's hardware support. Deployments at TensorWave run on AMD MI300X hardware; the same Helm charts and ml/ directory work on NVIDIA A100 and H100 clusters.

This means your experimental results are not tied to a single vendor's hardware availability, pricing, or software ecosystem. If you develop on NVIDIA and need to scale on AMD — or the other way around — the platform does not get in the way.

Designed for Experimentation

The training pipeline runs from a local ml/ directory that is packaged and shipped to the cluster automatically with each job submission. This means you can modify the training loop, optimizer, loss function, or data loader locally — with your normal editor and version control — and the cluster picks it up without a Docker rebuild or a redeployment.

Current production deployments include Gemma 3 4B Instruct, Gemma 3 Embedding 300M, and Qwen2 32B Instruct, running on multi-GPU Kubernetes clusters with live inference endpoints. Any model available on the Hugging Face Hub can be deployed; see the table below for models that have been validated on ScalarLM.

Supported Models

ScalarLM deploys any Hugging Face-hosted model. The following have been validated and are ready to use by setting the model field in values.yaml:

Model	Parameters	Architecture	Context	License	Notes
`google/gemma-3-4b-it`	4B	Dense	128K	Gemma ToU	Default deployment; production-tested
`google/gemma-3-27b-it`	27B	Dense	128K	Gemma ToU
`Qwen/Qwen2-32B-Instruct`	32B	Dense	128K	Apache 2.0	Production-tested
`Qwen/Qwen3.5-35B-A3B`	35B total / 3B active	Hybrid MoE + Gated Delta Networks	256K (1M via YaRN)	Apache 2.0	Native thinking mode; multimodal
`Qwen/Qwen3.5-122B-A10B`	122B total / 10B active	Hybrid MoE + Gated Delta Networks	1M	Apache 2.0	Requires multi-GPU; set `inference_gpus` accordingly
`openai/gpt-oss-120b`	117B total / 5.1B active	MoE	131K	Apache 2.0	Fits single 80GB GPU; verified on AMD
`openai/gpt-oss-20b`	21B total / 3.6B active	MoE	131K	Apache 2.0	Fits 16GB GPU; strong tool use and reasoning
`nvidia/Nemotron-3-Super-120B`	120B total / 12B active	Hybrid Mamba-Transformer MoE	1M	NVIDIA Open	Optimized for multi-agent workloads; NVFP4 recommended
`EssentialAI/rnj-1-instruct`	8B	Dense (Gemma 3 variant)	32K	Apache 2.0	Strong agentic coding and STEM; designed for post-training

Notes on multi-GPU deployments. Models above 30B active parameters typically require sharding across multiple GPUs. Set inference_gpus in values.yaml and enable sharding in the vLLM Helm chart. See the Kubernetes Deployment page for details.

Notes on MoE models. Mixture-of-Experts models (Qwen3.5, gpt-oss, Nemotron 3 Super) activate only a fraction of total parameters per token, which means they run significantly faster and cheaper than their total parameter count implies. ScalarLM's vLLM inference backend supports MoE natively on both NVIDIA and AMD hardware.

What It Is Not

ScalarLM is not a managed service or a training-as-a-service product. It is infrastructure you deploy and own. There is no scheduler-as-a-service, no auto-scaling, and no hosted model registry beyond what Hugging Face provides. If you want a fully managed experience, this is probably not the right tool. If you want to understand and control every layer of the stack, it is.

Get Started

I want to...	Start here
Understand the full system design	Architecture
Run my first inference or training job	Quick Start
Customize the training loop	Custom Training
Deploy to my own Kubernetes cluster	Kubernetes Deployment
Read the source	GitHub

Getting Started

Examples

Command Line

Deployment

Faq