RUNBOOKS¶
Incident response checklists for RUNE.
Common Incident: Vast.ai Instance Stuck in Creating¶
Symptom¶
- CLI or API job hangs in
creatingstate for more than 5 minutes. - Vast.ai dashboard shows instance as
startingorerrored.
Resolution¶
- Check Vast.ai API Key: Ensure
VAST_API_KEYis correct and has funds. - Manual Termination: If the instance is in a bad state, use the Vast.ai CLI or dashboard to terminate it to avoid costs.
- Adjust Constraints: If no offers match, try relaxing
--vastai-min-dphor--vastai-reliability.
Common Incident: Ollama Model Pull Failure¶
Symptom¶
- Workflow fails during
pull_modelphase with connection error.
Resolution¶
- Check Ollama Server Connectivity: Verify
RUNE_OLLAMA_URLis reachable from the RUNE runner. - Disk Space: Ensure the Ollama host has enough disk space for the requested model.
- Model Name: Double-check the model name (e.g.,
llama3.1:8bvs.llama3.1).
Common Incident: Job Store Lock (SQLite)¶
Symptom¶
sqlite3.OperationalError: database is lockedin logs.
Resolution¶
- Concurrency Check: Ensure multiple writers aren't trying to access the same
jobs.dbsimultaneously without proper locking. - K8s Volume Mount: If in K8s, verify the PersistentVolume claim is
ReadWriteOnceand not mounted by multiple pods. - Restart: Restarting the
rune-apipod may resolve transient locks.
Incident Response¶
- Report system-wide outages or security issues to [luca@bucaniere.us].
- Check
S3sink for evidence of job failure results.