Skip to main content

Deployment

Kubernetes Deployment

ScalarLM can be deployed on Kubernetes using the provided Helm charts. This guide covers cluster access, configuration, install/uninstall, and day-to-day operations including pod inspection and log access.


Prerequisites

  • Helm installed locally.
  • SSH access to your cluster's control plane node (see below).

A running Kubernetes cluster. You can set one up yourself or use a managed service like GKE or AKS. You can also use the provided Ansible playbook to install Kubernetes on an Ubuntu 22.04 VM:

ansible-playbook -i hosts -v deployment/ansible/k8.yml

SSH Access to the Cluster

ScalarLM clusters typically use a two-hop SSH configuration: you first connect to a gateway node, then jump to the Kubernetes control plane.

# Load your private key into the SSH agent
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/{your_private_key}
ssh-add -l

# Connect to the gateway
ssh -A -i ~/.ssh/{your_private_key} tensorwave@{gateway-ip}

# Jump to the control plane
ssh -A mia1-vm-scalarlm-k8s-ctrl-01

The -A flag forwards your SSH agent, which is required for the second hop.


Repository Structure

Clone the ScalarLM repository and navigate to the Helm charts directory:

git clone https://github.com/tensorwavecloud/ScalarLM.git
cd ScalarLM/deployment/helm

Each model deployment has its own subdirectory (e.g. gemma3_4b_it/, gemma3_embedding_300m/). Inside each, the scalarlm/templates/ directory contains the following Helm templates:

File Purpose
api_deployment.yaml The main ScalarLM API server
megatron_deployment.yaml Megatron-LM training pods (StatefulSet)
vllm_deployment.yaml vLLM inference pod
cloudflare_deployment.yaml Cloudflare tunnel for external access
api_configmap.yaml Environment config for the API pod
megatron_configmap.yaml Environment config for the Megatron pods
vllm_configmap.yaml Environment config for the vLLM pod
api_service.yaml ClusterIP service for the API
megratron_service.yaml ClusterIP service for the Megatron pods
vllm_service.yaml ClusterIP service for vLLM
cache_pvc.yaml Persistent volume for model cache
jobs_pvc.yaml Persistent volume for training jobs
slurm_config_pvc.yaml Persistent volume for Slurm config

Configuring values.yaml

All deployment parameters are controlled via values.yaml. Edit this file before installing or upgrading a deployment:

vim gemma3_4b_it/scalarlm/values.yaml

Full values.yaml Reference

image:
  repository: farbodatdocker/scalarlm
  tag: v1.4
  pullPolicy: Always

service:
  type: ClusterIP
  api_port: 8000
  vllm_port: 8001
  externalIP: 64.139.222.102   # Public IP of the node (not needed if using Cloudflare)

jobs_pvc:
  storageClass: longhorn
  size: 100Gi                  # Storage for training job artifacts and checkpoints

cache_pvc:
  storageClass: longhorn
  size: 400Gi                  # Storage for downloaded model weights

slurm_config_pvc:
  storageClass: longhorn
  size: 10Gi

cloudflared:
  tunnelToken: eyJ...          # Cloudflare tunnel token — see "Cloudflare" section below

model: google/gemma-3-4b-it    # Any HuggingFace model ID
max_model_length: 4096         # Maximum context window (tokens)
gpu_memory_utilization: 0.95   # Fraction of GPU VRAM to allocate to vLLM
dtype: bfloat16                # Model dtype: bfloat16, float16, or float32

training_gpus: 8               # Number of GPUs to use for Megatron training
inference_gpus: 1              # Number of GPUs to use for vLLM inference

max_train_time: 86400          # Maximum training job duration in seconds (default: 24h)

Key relationships:

  • training_gpus controls how many scalarlm-megatron-N pods are created.
  • inference_gpus is set at the deployment level in vllm_deployment.yaml. For models too large to fit on a single GPU, see Sharding for Inference.
  • If using Cloudflare, externalIP and hostPort in the Helm chart are not required.

Installing a Deployment

Navigate into the specific model's deployment directory and run Helm install:

cd ScalarLM/deployment/helm/gemma3_4b_it

helm -n gemma3-4b-it install scalarlm scalarlm

Monitor the rollout until all pods reach Running status:

watch kubectl -n gemma3-4b-it get pods

A healthy deployment looks like:

NAME                                    READY   STATUS    RESTARTS   AGE
scalarlm-6675679b96-4vrwl               1/1     Running   0          29s
scalarlm-cloudflared-644f75496b-mhgmm   1/1     Running   0          29s
scalarlm-megatron-0                     1/1     Running   0          29s
scalarlm-megatron-1                     1/1     Running   0          15s
scalarlm-megatron-2                     1/1     Running   0          13s
scalarlm-megatron-3                     1/1     Running   0          7s
scalarlm-vllm-6b78fcdbf-4s44w           1/1     Running   0          29s

Uninstalling a Deployment

cd ScalarLM/deployment/helm/gemma3_4b_it

helm -n gemma3-4b-it uninstall scalarlm

Note that the three persistent volume claims (scalarlm-cache, scalarlm-jobs, scalarlm-slurm-config) are intentionally retained by default so that model weights and job artifacts are not lost. Delete them manually only if you want a clean slate:

kubectl -n gemma3-4b-it delete pvc scalarlm-cache scalarlm-jobs scalarlm-slurm-config

Monitor until all pods are gone:

watch kubectl -n gemma3-4b-it get pods
# Expected: "No resources found in gemma3-4b-it namespace."

Inspecting a Running Deployment

List all namespaces

kubectl get namespace

Each deployed model runs in its own namespace (e.g. gemma3-4b-it, gemma3-embedding-300m, qwen2-32b-it).

List pods in a namespace

kubectl -n gemma3-4b-it get pods

View logs for a pod

# Stream logs from the API or vLLM pod
kubectl -n gemma3-4b-it logs -f scalarlm-vllm-6b78fcdbf-ql4rh

# Stream logs from a Megatron training pod
kubectl -n gemma3-4b-it logs -f scalarlm-megatron-0

Shell into a pod

kubectl -n gemma3-4b-it exec -it scalarlm-megatron-0 -- bash

Once inside, you can inspect the Slurm state and job directory:

# Check which nodes are visible to Slurm
sinfo

# Check the job queue
squeue

# Browse completed and active job artifacts
ls /app/cray/jobs/

Each job directory contains the checkpoint, config, dataset, logs, and a status.json:

checkpoint_16.pt
config.yaml
dataset.jsonlines
ml/
slurm-9.out
status.json
train_job_entrypoint.sh

Cloudflare Tunnel

ScalarLM supports Cloudflare Tunnels for exposing the API externally without opening ports or configuring a load balancer. This is the recommended approach for production deployments.

To enable it, set cloudflared.tunnelToken in values.yaml to your Cloudflare tunnel token. The scalarlm-cloudflared pod will automatically establish a secure outbound tunnel.

Note: When using Cloudflare, you do not need to configure hostPort in the Helm chart or set externalIP in values.yaml.

Sharding for Inference

For models that are too large to fit on a single GPU, vLLM supports tensor parallelism (sharding). To enable it, edit vllm_deployment.yaml and add the --tensor-parallel-size flag to the vLLM startup command, matching the number of GPUs you want to shard across:

args:
  - "--tensor-parallel-size"
  - "4"

Also update inference_gpus in values.yaml to match.

Note: When loading a model for inference across multiple megatron pods, it is generally better to load the model separately on each pod rather than sharing a single load, unless the model is too large to fit on a single GPU.

Troubleshooting

Deployment stops responding after a while

This can happen when the status.json file in a job directory is not updating correctly, causing the Megatron pod to stall. To recover:

After resolving the underlying issue, manually re-register the pod:

./start_slurm.sh

Shell into the pod and inspect the job's status.json:

kubectl -n gemma3-4b-it exec -it scalarlm-megatron-0 -- bashcat /app/cray/jobs/{job-id}/status.json

Check the pod logs for errors:

kubectl -n gemma3-4b-it logs -f scalarlm-megatron-0

A pod fails to start

Check the pod's events and logs:

kubectl -n gemma3-4b-it describe pod scalarlm-megatron-0
kubectl -n gemma3-4b-it logs scalarlm-megatron-0

If the pod failed due to a transient error, re-register it with ./start_slurm.sh after resolving the issue.

Pods can't see each other / Slurm shows nodes as down

Shell into a Megatron pod and run sinfo. All scalarlm-megatron-N nodes should appear as idle or alloc. If nodes are missing or in a down state, check the network configuration and ensure the Megatron StatefulSet pods have stable DNS entries (they communicate via the megratron_service).