Kubernetes Deployment
ScalarLM can be deployed on Kubernetes using the provided Helm charts. This guide covers cluster access, configuration, install/uninstall, and day-to-day operations including pod inspection and log access.
Prerequisites
- Helm installed locally.
- SSH access to your cluster's control plane node (see below).
A running Kubernetes cluster. You can set one up yourself or use a managed service like GKE or AKS. You can also use the provided Ansible playbook to install Kubernetes on an Ubuntu 22.04 VM:
ansible-playbook -i hosts -v deployment/ansible/k8.yml
SSH Access to the Cluster
ScalarLM clusters typically use a two-hop SSH configuration: you first connect to a gateway node, then jump to the Kubernetes control plane.
# Load your private key into the SSH agent
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/{your_private_key}
ssh-add -l
# Connect to the gateway
ssh -A -i ~/.ssh/{your_private_key} tensorwave@{gateway-ip}
# Jump to the control plane
ssh -A mia1-vm-scalarlm-k8s-ctrl-01
The -A flag forwards your SSH agent, which is required for the second hop.
Repository Structure
Clone the ScalarLM repository and navigate to the Helm charts directory:
git clone https://github.com/tensorwavecloud/ScalarLM.git
cd ScalarLM/deployment/helm
Each model deployment has its own subdirectory (e.g. gemma3_4b_it/, gemma3_embedding_300m/). Inside each, the scalarlm/templates/ directory contains the following Helm templates:
| File | Purpose |
|---|---|
api_deployment.yaml |
The main ScalarLM API server |
megatron_deployment.yaml |
Megatron-LM training pods (StatefulSet) |
vllm_deployment.yaml |
vLLM inference pod |
cloudflare_deployment.yaml |
Cloudflare tunnel for external access |
api_configmap.yaml |
Environment config for the API pod |
megatron_configmap.yaml |
Environment config for the Megatron pods |
vllm_configmap.yaml |
Environment config for the vLLM pod |
api_service.yaml |
ClusterIP service for the API |
megratron_service.yaml |
ClusterIP service for the Megatron pods |
vllm_service.yaml |
ClusterIP service for vLLM |
cache_pvc.yaml |
Persistent volume for model cache |
jobs_pvc.yaml |
Persistent volume for training jobs |
slurm_config_pvc.yaml |
Persistent volume for Slurm config |
Configuring values.yaml
All deployment parameters are controlled via values.yaml. Edit this file before installing or upgrading a deployment:
vim gemma3_4b_it/scalarlm/values.yaml
Full values.yaml Reference
image:
repository: farbodatdocker/scalarlm
tag: v1.4
pullPolicy: Always
service:
type: ClusterIP
api_port: 8000
vllm_port: 8001
externalIP: 64.139.222.102 # Public IP of the node (not needed if using Cloudflare)
jobs_pvc:
storageClass: longhorn
size: 100Gi # Storage for training job artifacts and checkpoints
cache_pvc:
storageClass: longhorn
size: 400Gi # Storage for downloaded model weights
slurm_config_pvc:
storageClass: longhorn
size: 10Gi
cloudflared:
tunnelToken: eyJ... # Cloudflare tunnel token — see "Cloudflare" section below
model: google/gemma-3-4b-it # Any HuggingFace model ID
max_model_length: 4096 # Maximum context window (tokens)
gpu_memory_utilization: 0.95 # Fraction of GPU VRAM to allocate to vLLM
dtype: bfloat16 # Model dtype: bfloat16, float16, or float32
training_gpus: 8 # Number of GPUs to use for Megatron training
inference_gpus: 1 # Number of GPUs to use for vLLM inference
max_train_time: 86400 # Maximum training job duration in seconds (default: 24h)
Key relationships:
training_gpuscontrols how manyscalarlm-megatron-Npods are created.inference_gpusis set at the deployment level invllm_deployment.yaml. For models too large to fit on a single GPU, see Sharding for Inference.- If using Cloudflare,
externalIPandhostPortin the Helm chart are not required.
Installing a Deployment
Navigate into the specific model's deployment directory and run Helm install:
cd ScalarLM/deployment/helm/gemma3_4b_it
helm -n gemma3-4b-it install scalarlm scalarlm
Monitor the rollout until all pods reach Running status:
watch kubectl -n gemma3-4b-it get pods
A healthy deployment looks like:
NAME READY STATUS RESTARTS AGE
scalarlm-6675679b96-4vrwl 1/1 Running 0 29s
scalarlm-cloudflared-644f75496b-mhgmm 1/1 Running 0 29s
scalarlm-megatron-0 1/1 Running 0 29s
scalarlm-megatron-1 1/1 Running 0 15s
scalarlm-megatron-2 1/1 Running 0 13s
scalarlm-megatron-3 1/1 Running 0 7s
scalarlm-vllm-6b78fcdbf-4s44w 1/1 Running 0 29s
Uninstalling a Deployment
cd ScalarLM/deployment/helm/gemma3_4b_it
helm -n gemma3-4b-it uninstall scalarlm
Note that the three persistent volume claims (scalarlm-cache, scalarlm-jobs, scalarlm-slurm-config) are intentionally retained by default so that model weights and job artifacts are not lost. Delete them manually only if you want a clean slate:
kubectl -n gemma3-4b-it delete pvc scalarlm-cache scalarlm-jobs scalarlm-slurm-config
Monitor until all pods are gone:
watch kubectl -n gemma3-4b-it get pods
# Expected: "No resources found in gemma3-4b-it namespace."
Inspecting a Running Deployment
List all namespaces
kubectl get namespace
Each deployed model runs in its own namespace (e.g. gemma3-4b-it, gemma3-embedding-300m, qwen2-32b-it).
List pods in a namespace
kubectl -n gemma3-4b-it get pods
View logs for a pod
# Stream logs from the API or vLLM pod
kubectl -n gemma3-4b-it logs -f scalarlm-vllm-6b78fcdbf-ql4rh
# Stream logs from a Megatron training pod
kubectl -n gemma3-4b-it logs -f scalarlm-megatron-0
Shell into a pod
kubectl -n gemma3-4b-it exec -it scalarlm-megatron-0 -- bash
Once inside, you can inspect the Slurm state and job directory:
# Check which nodes are visible to Slurm
sinfo
# Check the job queue
squeue
# Browse completed and active job artifacts
ls /app/cray/jobs/
Each job directory contains the checkpoint, config, dataset, logs, and a status.json:
checkpoint_16.pt
config.yaml
dataset.jsonlines
ml/
slurm-9.out
status.json
train_job_entrypoint.sh
Cloudflare Tunnel
ScalarLM supports Cloudflare Tunnels for exposing the API externally without opening ports or configuring a load balancer. This is the recommended approach for production deployments.
To enable it, set cloudflared.tunnelToken in values.yaml to your Cloudflare tunnel token. The scalarlm-cloudflared pod will automatically establish a secure outbound tunnel.
Note: When using Cloudflare, you do not need to configurehostPortin the Helm chart or setexternalIPinvalues.yaml.
Sharding for Inference
For models that are too large to fit on a single GPU, vLLM supports tensor parallelism (sharding). To enable it, edit vllm_deployment.yaml and add the --tensor-parallel-size flag to the vLLM startup command, matching the number of GPUs you want to shard across:
args:
- "--tensor-parallel-size"
- "4"
Also update inference_gpus in values.yaml to match.
Note: When loading a model for inference across multiple megatron pods, it is generally better to load the model separately on each pod rather than sharing a single load, unless the model is too large to fit on a single GPU.
Troubleshooting
Deployment stops responding after a while
This can happen when the status.json file in a job directory is not updating correctly, causing the Megatron pod to stall. To recover:
After resolving the underlying issue, manually re-register the pod:
./start_slurm.sh
Shell into the pod and inspect the job's status.json:
kubectl -n gemma3-4b-it exec -it scalarlm-megatron-0 -- bashcat /app/cray/jobs/{job-id}/status.json
Check the pod logs for errors:
kubectl -n gemma3-4b-it logs -f scalarlm-megatron-0
A pod fails to start
Check the pod's events and logs:
kubectl -n gemma3-4b-it describe pod scalarlm-megatron-0
kubectl -n gemma3-4b-it logs scalarlm-megatron-0
If the pod failed due to a transient error, re-register it with ./start_slurm.sh after resolving the issue.
Pods can't see each other / Slurm shows nodes as down
Shell into a Megatron pod and run sinfo. All scalarlm-megatron-N nodes should appear as idle or alloc. If nodes are missing or in a down state, check the network configuration and ensure the Megatron StatefulSet pods have stable DNS entries (they communicate via the megratron_service).