Inference
OpenAI Compatible Server
We host a deployment of ScalarLM on TensorWave for testing. You can access it at the https://llama8btensorwave.cray-lm.com endpoint.
For example, to submit a request to it:
curl https://llama8btensorwave.cray-lm.com/v1/openai/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'
Using the Python client
You can also use the Python client to interact with the ScalarLM server.
import scalarlm
scalarlm.api_url = "https://llama8btensorwave.cray-lm.com"
def get_dataset():
dataset = []
count = 4
for i in range(count):
dataset.append(f"What is {i} + {i}?")
return dataset
llm = scalarlm.SupermassiveIntelligence()
dataset = get_dataset()
results = llm.generate(prompts=dataset)
print(results)
Batching Support
ScalarLM supports batching through the python client. Notice in the example above that a list of prompts is provided to the llm.generate call.
ScalarLM will automatically distribute mini-batches of requests to inference GPUs and keep them fully utilized.
The requests are queued by ScalarLM, so you can submit very large numbers of queries. The parallelism and back pressure will automatically be handled by the ScalarLM client and server queues.
Image Batching Support
You can also use the Python client to submit multi-modal (text + image) requests to ScalarLM
import scalarlm
scalarlm.api_url = "https://llama70b.cray-lm.com"
def get_dataset():
dataset = []
count = 4
for i in range(count):
dataset.append({"text" : "What is in this image?", "image" : get_image(i)})
return dataset
llm = scalarlm.SupermassiveIntelligence()
dataset = get_dataset()
results = llm.generate(prompts=dataset)
print(results)
def get_image(index):
# Single-image input inference
if index < len(images) and images[index] is not None:
base64_string = pil_to_base64(images[index], format="PNG")
return base64_string
else:
print(f"Invalid index {index} or image is None")
return None
def pil_to_base64(pil_image, format="PNG"):
# Create an in-memory bytes buffer
buffer = BytesIO()
# Save the image to the buffer in the specified format
pil_image.save(buffer, format=format)
# Get the bytes from the buffer
img_bytes = buffer.getvalue()
# Encode the bytes as base64
base64_encoded = base64.b64encode(img_bytes).decode('utf-8')
return base64_encoded
def download_image(url):
"""Download an image from URL and return as PIL Image"""
try:
response = requests.get(url, timeout=10)
response.raise_for_status() # Raise an exception for bad status codes
# Convert bytes to PIL Image
image = Image.open(BytesIO(response.content))
return image
except Exception as e:
print(f"Error downloading image from {url}: {e}")
return None
# Example URLs - replace these with your actual image URLs
image_urls = [
"https://picsum.photos/300/200?random=1",
"https://picsum.photos/300/200?random=2",
"https://picsum.photos/300/200?random=3",
"https://picsum.photos/300/200?random=4"
]
# Download all images
images = []
print("Downloading images...")
for i, url in enumerate(image_urls):
print(f"Downloading image {i+1} from {url}")
img = download_image(url)
images.append(img)
if img:
print(f"✓ Image {i+1} downloaded successfully - Size: {img.size}")
else:
print(f"✗ Failed to download image {i+1}")