Save Models to Hugging Face
By default, ScalarLM saves fine-tuned model checkpoints locally inside the job directory on the server (e.g. checkpoint_16.pt). This guide shows you how to automatically push those checkpoints to the Hugging Face Hub at the end of a training run, so your model is versioned, shareable, and ready for use anywhere.
Prerequisites
- A Hugging Face account with a write-access token. Generate one at Settings → Access Tokens.
- The
huggingface_hubpackage, which is already available in the ScalarLM Docker image.
How It Works
ScalarLM's training pipeline runs from the ml/ directory. The relevant file for adding a post-training upload step is:
ml/cray_megatron/megatron/training_loop.py
Because ScalarLM uploads your local ./ml directory to the server with each job submission, you can make this change locally and it will be picked up automatically — no Docker rebuild required.
Step 1 — Check Out the ml/ Directory
Place a copy of the ml/ directory alongside your training script:
./train.py
./ml/ # your local customizations, uploaded automatically with each job
If you haven't already, clone the repo and copy the directory:
git clone https://github.com/tensorwavecloud/ScalarLM.git
cp -r ScalarLM/ml ./ml
Step 2 — Add the Upload Call to training_loop.py
Open ml/cray_megatron/megatron/training_loop.py. At the bottom of the training function, after the final checkpoint is saved, add the following block:
from huggingface_hub import HfApi
import os
def push_checkpoint_to_hub(
checkpoint_path: str,
repo_id: str,
hf_token: str,
commit_message: str = "Upload fine-tuned ScalarLM checkpoint",
):
"""Push a local checkpoint file to a Hugging Face repository."""
api = HfApi()
# Create the repo if it doesn't already exist
api.create_repo(
repo_id=repo_id,
token=hf_token,
exist_ok=True,
private=True, # set to False to make the model public
)
api.upload_file(
path_or_fileobj=checkpoint_path,
path_in_repo=os.path.basename(checkpoint_path),
repo_id=repo_id,
token=hf_token,
commit_message=commit_message,
)
print(f"Checkpoint pushed to https://huggingface.co/{repo_id}")
Then, at the point in the training loop where the final checkpoint is written (search for torch.save), call the function immediately after:
# Existing checkpoint save — already in training_loop.py
torch.save(model.state_dict(), checkpoint_path)
# --- Add this block ---
hf_token = os.environ.get("HF_TOKEN")
hf_repo = os.environ.get("HF_REPO_ID") # e.g. "your-username/my-fine-tuned-gemma"
if hf_token and hf_repo:
push_checkpoint_to_hub(
checkpoint_path=checkpoint_path,
repo_id=hf_repo,
hf_token=hf_token,
)
# ----------------------
The upload is gated on the presence of both environment variables, so it is a no-op if they are not set — existing jobs that don't provide them will be unaffected.
Step 3 — Pass the HF Credentials via train_args
Pass your Hugging Face token and target repo as environment variables through train_args in your training script:
import scalarlm
scalarlm.api_url = "https://gemma3_4b_it.farbodopensource.org"
llm = scalarlm.SupermassiveIntelligence()
dataset = [...] # your training data
status = llm.train(
dataset,
train_args={
"max_steps": 200,
"learning_rate": 3e-3,
"gpus": 2,
"env": {
"HF_TOKEN": "hf_your_write_token_here",
"HF_REPO_ID": "your-username/my-fine-tuned-gemma",
},
},
)
print(status)
Step 4 — Monitor and Verify
Once the job completes, check the training logs to confirm the upload succeeded:
kubectl -n gemma3-4b-it logs -f <pod-name>
# Look for: "Checkpoint pushed to https://huggingface.co/your-username/my-fine-tuned-gemma"
Then visit your repository on Hugging Face to confirm the checkpoint file is present.
Uploading a Full Model (Weights + Tokenizer)
The snippet above uploads the raw .pt checkpoint file. If you want to push a fully HuggingFace-compatible model (so it can be loaded with AutoModelForCausalLM.from_pretrained), load the checkpoint back into the model and use push_to_hub instead:
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model_name = config.model_name # e.g. "google/gemma-3-4b-it"
model = AutoModelForCausalLM.from_pretrained(base_model_name)
model.load_state_dict(torch.load(checkpoint_path))
model.push_to_hub(hf_repo, token=hf_token)
AutoTokenizer.from_pretrained(base_model_name).push_to_hub(hf_repo, token=hf_token)
This produces a repository that can be loaded directly with the HuggingFace transformers library or deployed via another ScalarLM instance.
Security Note
Never hard-code your HF_TOKEN in a file committed to version control. Use an environment variable, a secrets manager, or pass it at runtime as shown above. Hugging Face tokens can be scoped to write-only access for a specific repo to limit exposure.
Related
- Custom Training — how the
ml/directory upload works - Training Logs — monitoring job output
- Frequently Asked Questions — optimizer and loss function customization