Faq

Frequently Asked Questions

Getting Started

How do I install the ScalarLM client?

Install directly from PyPI:

pip install scalarlm

Then point the client at your deployment:

export SCALARLM_API_URL="https://gemma3_4b_it.farbodopensource.org"

The full CLI reference is available at CLI.

Do I need special prompt templates for training or inference?

Yes. Training and inference prompts must follow the chat template published on Hugging Face for the model your deployment is running. Each ScalarLM deployment is tied to a single base model, so refer to that model's HuggingFace card for the correct format. For example, this is the template for a Llama 3.1 8B Instruct deployment.

Can I use any Hugging Face model?

Any model supported by vLLM can in principle be deployed. In practice, deploying a new model means:

Setting the model field in values.yaml to the HuggingFace model ID (e.g. google/gemma-3-4b-it)
Adjusting max_model_length, dtype, and gpu_memory_utilization for the model's requirements
Running helm -n <namespace> uninstall scalarlm followed by helm -n <namespace> install scalarlm scalarlm

See the Kubernetes deployment guide for the full values reference.

Training & Fine-Tuning

What training parameters can I pass via `train_args`?

The following parameters are supported directly in train_args without modifying any source code:

Parameter	Type	Description
`max_steps`	int	Total number of training steps
`learning_rate`	float	Optimizer learning rate
`gpus`	int	Number of GPUs to request for the job
`dtype`	string	Model dtype, e.g. `"bfloat16"`, `"float32"`
`max_token_block_size`	int	Token chunk size (replaces batch size — see below)

Do not use max_gpus; it is a debugging parameter only.

For anything beyond this — custom loss functions, optimizer swaps, dataset preprocessing — you'll need to modify the ml/ directory directly. See Custom Training.

What is the relationship between training steps and epochs in ScalarLM?

ScalarLM moves away from conventional batch-based training. Instead of batching examples, it flattens your entire dataset into a single continuous token stream and splits it into fixed-length chunks controlled by max_token_block_size.

Training step: one forward-and-backward pass over one chunk on one GPU
Epoch: a complete pass through all chunks across all GPUs

steps_per_epoch = total_chunks ÷ num_GPUs

There is no batch_size parameter. For multi-GPU jobs, chunks are partitioned into shards — one per GPU — and each GPU iterates through its shard sequentially. This guarantees uniform token throughput and balanced GPU workloads. The chunking source code is here.

How do I change the token chunk (block) size?

Pass max_token_block_size in train_args:

llm.train(
    dataset,
    train_args={
        "max_steps": 200,
        "learning_rate": 3e-4,
        "max_token_block_size": 1024
    }
)

Can I change the dtype for training?

Yes, via train_args:

llm.train(
    dataset,
    train_args={
        "max_steps": 200,
        "learning_rate": 3e-3,
        "gpus": 1,
        "dtype": "float32"
    }
)

At inference time, vLLM uses the dtype defined in the deployment configuration. If the checkpoint dtype differs, vLLM converts automatically when loading.

What happens if I launch multiple fine-tuning jobs at once?

Jobs are queued automatically by the built-in Slurm scheduler. You can inspect the queue at any time:

scalarlm squeue

Or, if you're logged into the cluster directly:

kubectl -n <namespace> exec -it scalarlm-megatron-0 -- bash
squeue

How do I run inference from the base model without using a fine-tuned checkpoint?

By default, ScalarLM loads the fine-tuned model name from the job config. To run inference directly from the base model, open ml/cray_megatron/megatron/training_loop.py and comment out the line that overrides the model name:

# model_name = config.model_name   # comment this out to use the base model

How do I change the loss function or optimizer?

Both are configured in ml/cray_megatron/megatron/training_loop.py. Place a customized copy of the ml/ directory alongside your training script and it will be uploaded automatically with your next job submission — no Docker rebuild required.

Switching from Adam to Adafactor:

# Replace the Adam optimizer block with:
from transformers.optimization import Adafactor

optimizer = Adafactor(
    model.parameters(),
    scale_parameter=True,
    relative_step=True,
    warmup_init=True,
    lr=None,
)

Swapping the loss function: the loss is computed around line 105 of training_loop.py. Replace the default cross-entropy call with any PyTorch-compatible loss.

For embedding model training, the relevant files are:

ml/cray_megatron/models/tokenformer/load_tokenformer_model.py — model architecture
ml/cray_megatron/megatron/dataset/load_dataset.py — dataset loading and chunking
ml/cray_megatron/megatron/training_loop.py — training loop and loss

Can I implement RLHF?

Yes. Use the ScalarLM inference endpoint to score or rank outputs with your reward model, then feed the selected data back into the training endpoint to update the model. Because the training and inference APIs are both OpenAI-compatible, the reward model can be hosted on a separate ScalarLM deployment.

Is early stopping available?

The framework doesn't expose early stopping parameters natively, but because the training loop is built on PyTorch and HuggingFace, you can integrate the HuggingFace EarlyStoppingCallback into training_loop.py directly.

How do I save my fine-tuned model to Hugging Face?

See the dedicated guide: Save Fine-Tuned Model to Hugging Face. It covers pushing raw .pt checkpoints as well as publishing a fully from_pretrained-compatible model repository.

Inference

How do I run inference from an embedding model?

vLLM supports only a limited set of embedding models. For local embedding inference, use the sentence-transformers library instead:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("your-username/my-fine-tuned-embedding-model")
embeddings = model.encode(["Hello world", "ScalarLM is fast"])
print(embeddings.shape)

Install it with:

pip install sentence-transformers

For a full list of models supported natively by vLLM for embedding, see the vLLM supported models list.

How do I enable inference with tensor parallelism (sharding)?

For models too large to fit on a single GPU, vLLM supports tensor parallelism. Enable it by adding the --tensor-parallel-size flag to the vLLM startup args in vllm_deployment.yaml inside your Helm chart templates:

args:
  - "--tensor-parallel-size"
  - "2"   # set to the number of GPUs you want vLLM to shard across

Make sure inference_gpus in values.yaml matches or exceeds this value, then redeploy.

Should I load the model on all pods or just one?

It's generally better to load the model separately on each pod. This gives you independent inference capacity per pod and avoids a single point of failure. The exception is a model so large it cannot fit in a single GPU's memory — in that case use tensor parallelism across GPUs within a single vLLM instance (see above) rather than splitting across pods.

How do I set inference temperature?

Temperature is a parameter passed directly to vLLM at request time via the OpenAI-compatible API:

import openai

client = openai.OpenAI(base_url="https://gemma3_4b_it.farbodopensource.org/v1", api_key="none")

response = client.chat.completions.create(
    model="google/gemma-3-4b-it",
    messages=[{"role": "user", "content": "Write a short poem."}],
    temperature=0.8,
)

Lower values (e.g. 0.2) produce more deterministic outputs; higher values (e.g. 0.9) produce more varied outputs. See the vLLM quickstart for the full list of sampling parameters.

Does ScalarLM cache inference results?

No. Inference is fast enough that response caching is not provided at the platform level. If your application needs caching, implement it client-side.

Monitoring & Logs

How do I monitor fine-tuning progress and loss curves?

Use the scalarlm plot CLI command:

scalarlm plot

This connects to your deployment and renders a live loss curve. Make sure SCALARLM_API_URL is set before running.

How do I check logs for a running job?

First, find the relevant namespace and pods:

kubectl get namespace
kubectl -n <namespace> get pods

Then tail the logs for the pod you want to inspect:

# API / main pod
kubectl -n <namespace> logs -f <scalarlm-pod-name>

# vLLM inference pod
kubectl -n <namespace> logs -f <scalarlm-vllm-pod-name>

# Megatron training pod
kubectl -n <namespace> logs -f scalarlm-megatron-0

To get a shell inside a pod and inspect the job directory directly:

kubectl -n <namespace> exec -it scalarlm-megatron-0 -- bash

# Inside the pod:
cd /app/cray/jobs/<job-id>/
ls -a   # checkpoint files, config.yaml, dataset, slurm output, status.json
cat slurm-9.out   # training output for job 9

Deployment & Operations

Do I need to set `hostPort` in the Helm chart when using Cloudflare?

No. When routing traffic through a Cloudflare tunnel, the hostPort field in the Helm chart is not needed. Traffic is handled by the cloudflared sidecar using the tunnelToken set in values.yaml.

Why did my deployment stop working after running for a while?

This is most often caused by the job status JSON file failing to update correctly, which leaves the scheduler in a bad state. To recover:

Check the megatron pod logs for errors: kubectl -n <namespace> logs -f scalarlm-megatron-0
Fix the underlying issue (disk space, permissions, crashed process)
If the pod is stuck and not self-recovering, manually re-register it:

kubectl -n <namespace> exec -it scalarlm-megatron-0 -- bash
./start_slurm.sh

When should I manually run `./start_slurm.sh`?

Only when a pod fails to start due to an error and does not recover on its own. Normal deployments register pods automatically. Manual registration is a recovery step — resolve the root cause first, then run ./start_slurm.sh to bring the pod back into the Slurm cluster.

How do I modify deployment settings (model, GPU count, context length)?

Edit values.yaml in your Helm chart directory:

vim /tensorwave/farbod/ScalarLM/deployment/helm/gemma3_4b_it/scalarlm/values.yaml

Key fields:

Field	Description
`model`	HuggingFace model ID
`training_gpus`	GPUs allocated to the Megatron training pods
`inference_gpus`	GPUs allocated to the vLLM pod
`max_model_length`	Maximum context length (tokens)
`gpu_memory_utilization`	Fraction of GPU memory vLLM may use (0.0–1.0)
`dtype`	Model dtype: `bfloat16`, `float16`, `float32`
`max_train_time`	Maximum training job duration in seconds
`cloudflared.tunnelToken`	Cloudflare tunnel token for public endpoint routing

After editing, redeploy:

helm -n <namespace> uninstall scalarlm
helm -n <namespace> install scalarlm scalarlm

Monitor the rollout with:

watch kubectl -n <namespace> get pods

How do I uninstall and reinstall a deployment?

# Uninstall (PersistentVolumeClaims are retained by default)
helm -n <namespace> uninstall scalarlm

# Reinstall
helm -n <namespace> install scalarlm scalarlm

Note that scalarlm-cache, scalarlm-jobs, and scalarlm-slurm-config PVCs are kept across uninstalls. This means your job history and cached model weights are preserved. To wipe them, delete the PVCs explicitly before reinstalling.

Advanced Customization

Where is the full list of fine-tuning parameters?

ScalarLM intentionally does not maintain a single config file for all training parameters. Instead, it gives you direct access to the source code in ml/, which offers maximum flexibility. The parameters passable via train_args without code changes are listed in the Training Parameters section above. For anything else, modify ml/cray_megatron/megatron/training_loop.py.

Can I use a custom model architecture?

Yes. Add your model code under ml/, update load_tokenformer_model.py to load your architecture, and submit a training job with your local ml/ directory present. It will be uploaded and used server-side automatically.

Where can I find ScalarLM's source code?

The full repository is at github.com/tensorwavecloud/ScalarLM. It is CC-0 licensed.

Getting Started

Examples

Command Line

Deployment

Faq