Skip to content

RUNE Documentation

Runbooks

lpasquali/rune-docs

RUNBOOKS¶

Incident response checklists for RUNE.

Common Incident: Vast.ai Instance Stuck in Creating¶

Symptom¶

CLI or API job hangs in creating state for more than 5 minutes.
Vast.ai dashboard shows instance as starting or errored.

Resolution¶

Check Vast.ai API Key: Ensure VAST_API_KEY is correct and has funds.
Manual Termination: If the instance is in a bad state, use the Vast.ai CLI or dashboard to terminate it to avoid costs.
Adjust Constraints: If no offers match, try relaxing --vastai-min-dph or --vastai-reliability.

Common Incident: Ollama Model Pull Failure¶

Symptom¶

Workflow fails during pull_model phase with connection error.

Resolution¶

Check Ollama Server Connectivity: Verify RUNE_OLLAMA_URL is reachable from the RUNE runner.
Disk Space: Ensure the Ollama host has enough disk space for the requested model.
Model Name: Double-check the model name (e.g., llama3.1:8b vs. llama3.1).

Common Incident: Job Store Lock (SQLite)¶

Symptom¶

sqlite3.OperationalError: database is locked in logs.

Resolution¶

Concurrency Check: Ensure multiple writers aren't trying to access the same jobs.db simultaneously without proper locking.
K8s Volume Mount: If in K8s, verify the PersistentVolume claim is ReadWriteOnce and not mounted by multiple pods.
Restart: Restarting the rune-api pod may resolve transient locks.

Incident Response¶

Report system-wide outages or security issues to [luca@bucaniere.us].
Check S3 sink for evidence of job failure results.