Skip to main content

Faq

Save Models to Hugging Face


By default, ScalarLM saves fine-tuned model checkpoints locally inside the job directory on the server (e.g. checkpoint_16.pt). This guide shows you how to automatically push those checkpoints to the Hugging Face Hub at the end of a training run, so your model is versioned, shareable, and ready for use anywhere.

Prerequisites

  • A Hugging Face account with a write-access token. Generate one at Settings → Access Tokens.
  • The huggingface_hub package, which is already available in the ScalarLM Docker image.

How It Works

ScalarLM's training pipeline runs from the ml/ directory. The relevant file for adding a post-training upload step is:

ml/cray_megatron/megatron/training_loop.py

Because ScalarLM uploads your local ./ml directory to the server with each job submission, you can make this change locally and it will be picked up automatically — no Docker rebuild required.


Step 1 — Check Out the ml/ Directory

Place a copy of the ml/ directory alongside your training script:

./train.py
./ml/   # your local customizations, uploaded automatically with each job

If you haven't already, clone the repo and copy the directory:

git clone https://github.com/tensorwavecloud/ScalarLM.git
cp -r ScalarLM/ml ./ml

Step 2 — Add the Upload Call to training_loop.py

Open ml/cray_megatron/megatron/training_loop.py. At the bottom of the training function, after the final checkpoint is saved, add the following block:

from huggingface_hub import HfApi
import os

def push_checkpoint_to_hub(
    checkpoint_path: str,
    repo_id: str,
    hf_token: str,
    commit_message: str = "Upload fine-tuned ScalarLM checkpoint",
):
    """Push a local checkpoint file to a Hugging Face repository."""
    api = HfApi()

    # Create the repo if it doesn't already exist
    api.create_repo(
        repo_id=repo_id,
        token=hf_token,
        exist_ok=True,
        private=True,       # set to False to make the model public
    )

    api.upload_file(
        path_or_fileobj=checkpoint_path,
        path_in_repo=os.path.basename(checkpoint_path),
        repo_id=repo_id,
        token=hf_token,
        commit_message=commit_message,
    )

    print(f"Checkpoint pushed to https://huggingface.co/{repo_id}")

Then, at the point in the training loop where the final checkpoint is written (search for torch.save), call the function immediately after:

# Existing checkpoint save — already in training_loop.py
torch.save(model.state_dict(), checkpoint_path)

# --- Add this block ---
hf_token = os.environ.get("HF_TOKEN")
hf_repo  = os.environ.get("HF_REPO_ID")   # e.g. "your-username/my-fine-tuned-gemma"

if hf_token and hf_repo:
    push_checkpoint_to_hub(
        checkpoint_path=checkpoint_path,
        repo_id=hf_repo,
        hf_token=hf_token,
    )
# ----------------------

The upload is gated on the presence of both environment variables, so it is a no-op if they are not set — existing jobs that don't provide them will be unaffected.


Step 3 — Pass the HF Credentials via train_args

Pass your Hugging Face token and target repo as environment variables through train_args in your training script:

import scalarlm

scalarlm.api_url = "https://gemma3_4b_it.farbodopensource.org"

llm = scalarlm.SupermassiveIntelligence()

dataset = [...]  # your training data

status = llm.train(
    dataset,
    train_args={
        "max_steps": 200,
        "learning_rate": 3e-3,
        "gpus": 2,
        "env": {
            "HF_TOKEN":    "hf_your_write_token_here",
            "HF_REPO_ID":  "your-username/my-fine-tuned-gemma",
        },
    },
)

print(status)

Step 4 — Monitor and Verify

Once the job completes, check the training logs to confirm the upload succeeded:

kubectl -n gemma3-4b-it logs -f <pod-name>
# Look for: "Checkpoint pushed to https://huggingface.co/your-username/my-fine-tuned-gemma"

Then visit your repository on Hugging Face to confirm the checkpoint file is present.


Uploading a Full Model (Weights + Tokenizer)

The snippet above uploads the raw .pt checkpoint file. If you want to push a fully HuggingFace-compatible model (so it can be loaded with AutoModelForCausalLM.from_pretrained), load the checkpoint back into the model and use push_to_hub instead:

from transformers import AutoModelForCausalLM, AutoTokenizer

base_model_name = config.model_name   # e.g. "google/gemma-3-4b-it"
model = AutoModelForCausalLM.from_pretrained(base_model_name)
model.load_state_dict(torch.load(checkpoint_path))

model.push_to_hub(hf_repo, token=hf_token)
AutoTokenizer.from_pretrained(base_model_name).push_to_hub(hf_repo, token=hf_token)

This produces a repository that can be loaded directly with the HuggingFace transformers library or deployed via another ScalarLM instance.


Security Note

Never hard-code your HF_TOKEN in a file committed to version control. Use an environment variable, a secrets manager, or pass it at runtime as shown above. Hugging Face tokens can be scoped to write-only access for a specific repo to limit exposure.