Skip to main content

Introduction

ScalarLM is an open-source platform for closed-loop LLM experimentation — running a model and post-training it against live feedback in the same deployment, across GPU hardware from any vendor, at scales from a single workstation to multi-node Kubernetes clusters.

It is CC-0 licensed. You can fork it, publish with it, build on it, and ship it without restriction or attribution.


Why ScalarLM

Most training and inference infrastructure is designed for one direction of work: either serving a fixed model, or producing a new one. Closing the loop — using a live model's outputs to drive the next round of post-training — typically requires stitching together separate systems with different APIs, checkpoint formats, and scheduling assumptions.

ScalarLM is built around that loop as the primary use case. A single deployment exposes both an inference endpoint and a training endpoint. You can query the running model, construct training signal from the results, and submit a post-training job against the same deployment without touching infrastructure. The updated checkpoint is picked up automatically at the next inference request.

This makes it well-suited for research in online learning, RLHF pipelines, self-improvement, and any setting where the boundary between inference and training needs to be thin.


Architecture

ScalarLM composes three production-grade components, each responsible for a distinct part of the stack:

  • vLLM handles live inference with PagedAttention, continuous batching, and high token throughput. Each deployment exposes an OpenAI-compatible endpoint backed by vLLM.
  • Megatron-LM handles distributed training, providing tensor and pipeline parallelism for scaling across multiple GPUs and nodes. Training jobs are dispatched via a Slurm scheduler running inside the Kubernetes deployment.
  • Hugging Face Hub is the model source and, optionally, the model sink. Any Hub-hosted model can be deployed; post-training checkpoints can be pushed back to the Hub automatically at the end of a training run.

The seam between them is a shared checkpoint system. vLLM loads from a checkpoint; Megatron writes to one. The inference pod does not need to restart for the update to take effect.

A full architecture diagram is on the Architecture page.


GPU Agnostic

ScalarLM runs on NVIDIA and AMD GPUs without code changes. The training stack is built on PyTorch, and the inference stack inherits vLLM's hardware support. Deployments at TensorWave run on AMD MI300X hardware; the same Helm charts and ml/ directory work on NVIDIA A100 and H100 clusters.

This means your experimental results are not tied to a single vendor's hardware availability, pricing, or software ecosystem. If you develop on NVIDIA and need to scale on AMD — or the other way around — the platform does not get in the way.


Designed for Experimentation

The training pipeline runs from a local ml/ directory that is packaged and shipped to the cluster automatically with each job submission. This means you can modify the training loop, optimizer, loss function, or data loader locally — with your normal editor and version control — and the cluster picks it up without a Docker rebuild or a redeployment.

Current production deployments include Gemma 3 4B Instruct, Gemma 3 Embedding 300M, and Qwen2 32B Instruct, running on multi-GPU Kubernetes clusters with live inference endpoints. Any model available on the Hugging Face Hub can be deployed; see the table below for models that have been validated on ScalarLM.


Supported Models

ScalarLM deploys any Hugging Face-hosted model. The following have been validated and are ready to use by setting the model field in values.yaml:

Model Parameters Architecture Context License Notes
google/gemma-3-4b-it 4B Dense 128K Gemma ToU Default deployment; production-tested
google/gemma-3-27b-it 27B Dense 128K Gemma ToU
Qwen/Qwen2-32B-Instruct 32B Dense 128K Apache 2.0 Production-tested
Qwen/Qwen3.5-35B-A3B 35B total / 3B active Hybrid MoE + Gated Delta Networks 256K (1M via YaRN) Apache 2.0 Native thinking mode; multimodal
Qwen/Qwen3.5-122B-A10B 122B total / 10B active Hybrid MoE + Gated Delta Networks 1M Apache 2.0 Requires multi-GPU; set inference_gpus accordingly
openai/gpt-oss-120b 117B total / 5.1B active MoE 131K Apache 2.0 Fits single 80GB GPU; verified on AMD
openai/gpt-oss-20b 21B total / 3.6B active MoE 131K Apache 2.0 Fits 16GB GPU; strong tool use and reasoning
nvidia/Nemotron-3-Super-120B 120B total / 12B active Hybrid Mamba-Transformer MoE 1M NVIDIA Open Optimized for multi-agent workloads; NVFP4 recommended
EssentialAI/rnj-1-instruct 8B Dense (Gemma 3 variant) 32K Apache 2.0 Strong agentic coding and STEM; designed for post-training

Notes on multi-GPU deployments. Models above 30B active parameters typically require sharding across multiple GPUs. Set inference_gpus in values.yaml and enable sharding in the vLLM Helm chart. See the Kubernetes Deployment page for details.

Notes on MoE models. Mixture-of-Experts models (Qwen3.5, gpt-oss, Nemotron 3 Super) activate only a fraction of total parameters per token, which means they run significantly faster and cheaper than their total parameter count implies. ScalarLM's vLLM inference backend supports MoE natively on both NVIDIA and AMD hardware.


What It Is Not

ScalarLM is not a managed service or a training-as-a-service product. It is infrastructure you deploy and own. There is no scheduler-as-a-service, no auto-scaling, and no hosted model registry beyond what Hugging Face provides. If you want a fully managed experience, this is probably not the right tool. If you want to understand and control every layer of the stack, it is.


Get Started

I want to... Start here
Understand the full system design Architecture
Run my first inference or training job Quick Start
Customize the training loop Custom Training
Deploy to my own Kubernetes cluster Kubernetes Deployment
Read the source GitHub