Introduction
ScalarLM is an open-source platform for closed-loop LLM experimentation — running a model and post-training it against live feedback in the same deployment, across GPU hardware from any vendor, at scales from a single workstation to multi-node Kubernetes clusters.
It is CC-0 licensed. You can fork it, publish with it, build on it, and ship it without restriction or attribution.
Why ScalarLM
Most training and inference infrastructure is designed for one direction of work: either serving a fixed model, or producing a new one. Closing the loop — using a live model's outputs to drive the next round of post-training — typically requires stitching together separate systems with different APIs, checkpoint formats, and scheduling assumptions.
ScalarLM is built around that loop as the primary use case. A single deployment exposes both an inference endpoint and a training endpoint. You can query the running model, construct training signal from the results, and submit a post-training job against the same deployment without touching infrastructure. The updated checkpoint is picked up automatically at the next inference request.
This makes it well-suited for research in online learning, RLHF pipelines, self-improvement, and any setting where the boundary between inference and training needs to be thin.
Architecture
ScalarLM composes three production-grade components, each responsible for a distinct part of the stack:
- vLLM handles live inference with PagedAttention, continuous batching, and high token throughput. Each deployment exposes an OpenAI-compatible endpoint backed by vLLM.
- Megatron-LM handles distributed training, providing tensor and pipeline parallelism for scaling across multiple GPUs and nodes. Training jobs are dispatched via a Slurm scheduler running inside the Kubernetes deployment.
- Hugging Face Hub is the model source and, optionally, the model sink. Any Hub-hosted model can be deployed; post-training checkpoints can be pushed back to the Hub automatically at the end of a training run.
The seam between them is a shared checkpoint system. vLLM loads from a checkpoint; Megatron writes to one. The inference pod does not need to restart for the update to take effect.
A full architecture diagram is on the Architecture page.
GPU Agnostic
ScalarLM runs on NVIDIA and AMD GPUs without code changes. The training stack is built on PyTorch, and the inference stack inherits vLLM's hardware support. Deployments at TensorWave run on AMD MI300X hardware; the same Helm charts and ml/ directory work on NVIDIA A100 and H100 clusters.
This means your experimental results are not tied to a single vendor's hardware availability, pricing, or software ecosystem. If you develop on NVIDIA and need to scale on AMD — or the other way around — the platform does not get in the way.
Designed for Experimentation
The training pipeline runs from a local ml/ directory that is packaged and shipped to the cluster automatically with each job submission. This means you can modify the training loop, optimizer, loss function, or data loader locally — with your normal editor and version control — and the cluster picks it up without a Docker rebuild or a redeployment.
Current production deployments include Gemma 3 4B Instruct, Gemma 3 Embedding 300M, and Qwen2 32B Instruct, running on multi-GPU Kubernetes clusters with live inference endpoints. Any model available on the Hugging Face Hub can be deployed; see the table below for models that have been validated on ScalarLM.
Supported Models
ScalarLM deploys any Hugging Face-hosted model. The following have been validated and are ready to use by setting the model field in values.yaml:
| Model | Parameters | Architecture | Context | License | Notes |
|---|---|---|---|---|---|
google/gemma-3-4b-it |
4B | Dense | 128K | Gemma ToU | Default deployment; production-tested |
google/gemma-3-27b-it |
27B | Dense | 128K | Gemma ToU | |
Qwen/Qwen2-32B-Instruct |
32B | Dense | 128K | Apache 2.0 | Production-tested |
Qwen/Qwen3.5-35B-A3B |
35B total / 3B active | Hybrid MoE + Gated Delta Networks | 256K (1M via YaRN) | Apache 2.0 | Native thinking mode; multimodal |
Qwen/Qwen3.5-122B-A10B |
122B total / 10B active | Hybrid MoE + Gated Delta Networks | 1M | Apache 2.0 | Requires multi-GPU; set inference_gpus accordingly |
openai/gpt-oss-120b |
117B total / 5.1B active | MoE | 131K | Apache 2.0 | Fits single 80GB GPU; verified on AMD |
openai/gpt-oss-20b |
21B total / 3.6B active | MoE | 131K | Apache 2.0 | Fits 16GB GPU; strong tool use and reasoning |
nvidia/Nemotron-3-Super-120B |
120B total / 12B active | Hybrid Mamba-Transformer MoE | 1M | NVIDIA Open | Optimized for multi-agent workloads; NVFP4 recommended |
EssentialAI/rnj-1-instruct |
8B | Dense (Gemma 3 variant) | 32K | Apache 2.0 | Strong agentic coding and STEM; designed for post-training |
Notes on multi-GPU deployments. Models above 30B active parameters typically require sharding across multiple GPUs. Set inference_gpus in values.yaml and enable sharding in the vLLM Helm chart. See the Kubernetes Deployment page for details.
Notes on MoE models. Mixture-of-Experts models (Qwen3.5, gpt-oss, Nemotron 3 Super) activate only a fraction of total parameters per token, which means they run significantly faster and cheaper than their total parameter count implies. ScalarLM's vLLM inference backend supports MoE natively on both NVIDIA and AMD hardware.
What It Is Not
ScalarLM is not a managed service or a training-as-a-service product. It is infrastructure you deploy and own. There is no scheduler-as-a-service, no auto-scaling, and no hosted model registry beyond what Hugging Face provides. If you want a fully managed experience, this is probably not the right tool. If you want to understand and control every layer of the stack, it is.
Get Started
| I want to... | Start here |
|---|---|
| Understand the full system design | Architecture |
| Run my first inference or training job | Quick Start |
| Customize the training loop | Custom Training |
| Deploy to my own Kubernetes cluster | Kubernetes Deployment |
| Read the source | GitHub |