Advanced Routing with Rhai: Custom AI Scheduling Logic
Stop using static routing. Master rbee's Rhai scripting engine to implement intelligent routing for A/B testing, canary deployments, and GPU farm monetization. Premium feature included in Queen + Worker bundle (€279).
Advanced Rhai routing is included in Premium Queen (€129 lifetime) and Queen + Worker Bundle (€279 lifetime). Free tier includes basic round-robin routing only. Pre-launch pricing available now through Q2 2026.
Why Custom Routing? The Problem with Static Load Balancing
Basic routing (round-robin, least-loaded) works for simple setups, but production AI infrastructure needs intelligence. Rhai scripting lets you implement sophisticated routing that Ollama and vLLM can't match. Here's why custom routing matters:
rbee uses Rhai, a Rust-embedded scripting language, for custom routing logic. Unlike Ollama (no routing) or vLLM (basic load balancing), Rhai gives you complete control over how requests are distributed.
What is Rhai? A Powerful Scripting Engine for AI Orchestration
Rhai is a simple, fast scripting language designed for embedding in Rust applications. It has a JavaScript-like syntax and is sandboxed for security. rbee uses Rhai to give you complete control over how AI requests are routed across your multi-machine GPU cluster without the complexity of Kubernetes or vLLM.
// Rhai syntax is similar to JavaScriptlet x = 10;let y = 20;let sum = x + y;
// Functionsfn greet(name) { return "Hello, " + name;}
// Arrays and loopslet workers = [1, 2, 3, 4, 5];for worker in workers { print(worker);}Basic Routing Script: Getting Started with Rhai
Create a routing script at ~/.config/rbee/routing.rhai. This is the foundation for all custom routing logic:
// Basic routing functionfn route_request(request, workers) { // Filter available workers let available = workers.filter(|w| w.status == "ready"); // Return the first available worker return available[0];}Request Object: Understanding AI Request Metadata
The request object contains information about the incoming AI inference request. Use this data to make intelligent routing decisions:
// Request object structure{ model: "llama-3.1-8b", // Model name model_size: 8_000_000_000, // Model size in parameters user_id: "user@example.com", // User identifier (if provided) priority: "normal", // "low", "normal", "high" max_tokens: 2048, // Max output tokens stream: true, // Streaming response? metadata: { // Custom metadata "tier": "premium", "region": "eu-west" }}Worker Object: GPU Worker Capabilities and Status
The workers array contains information about available GPU workers (LLM, Stable Diffusion, etc). Each worker has metadata about its capabilities, load, and performance:
// Worker object structure{ id: "worker-gaming-pc", status: "ready", // "ready", "busy", "offline" backend: "CUDA", // "CUDA", "Metal", "ROCm", "CPU" vram: 24, // VRAM in GB vram_used: 8, // Currently used VRAM load: 0.3, // Current load (0.0 - 1.0) models: ["llama-3.1-8b", ...], // Available models performance: 120, // Tokens/sec benchmark cost_per_hour: 0.50, // Estimated cost metadata: { // Custom metadata "region": "us-east", "tier": "premium" }}Example 1: A/B Testing AI Models
Route 10% of traffic to a new model version for safe experimentation:
fn route_request(request, workers) { // Generate random number 0-99 let random = rand() * 100; // 10% of traffic goes to new model let target_model = if random < 10 { "llama-3.2-8b" // New version } else { "llama-3.1-8b" // Stable version }; // Find workers with the target model let available = workers.filter(|w| w.status == "ready" && w.models.contains(target_model) ); // Return least loaded worker return available.sort_by(|a, b| a.load < b.load)[0];}Example 2: Canary Deployment for AI Models
Gradually increase traffic to a new model version with zero downtime:
// Store canary percentage in metadataconst CANARY_PERCENT = 25; // Start with 25%
fn route_request(request, workers) { let random = rand() * 100; // Canary traffic if random < CANARY_PERCENT { let canary_workers = workers.filter(|w| w.metadata.get("canary") == true ); if !canary_workers.is_empty() { return canary_workers.sort_by(|a, b| a.load < b.load)[0]; } } // Stable traffic let stable_workers = workers.filter(|w| w.metadata.get("canary") != true && w.status == "ready" ); return stable_workers.sort_by(|a, b| a.load < b.load)[0];}Example 3: User-Based Routing for GPU Farm Monetization
Route premium customers to faster GPUs, free users to CPU. This is how GPU operators generate revenue:
fn route_request(request, workers) { let user_tier = request.metadata.get("tier"); // Premium users get high-performance workers if user_tier == "premium" { let premium_workers = workers.filter(|w| w.backend == "CUDA" && w.performance > 100 && w.status == "ready" ); if !premium_workers.is_empty() { return premium_workers.sort_by(|a, b| a.load < b.load)[0]; } } // Free tier users get CPU or slower GPUs if user_tier == "free" { let free_workers = workers.filter(|w| (w.backend == "CPU" || w.performance < 50) && w.status == "ready" ); if !free_workers.is_empty() { return free_workers.sort_by(|a, b| a.load < b.load)[0]; } } // Default: any available worker return workers.filter(|w| w.status == "ready")[0];}Example 4: Cost Optimization with Intelligent Routing
Use cheaper CPU workers for simple queries, reserve expensive GPUs for complex tasks:
fn route_request(request, workers) { let max_tokens = request.max_tokens; // Simple queries (< 100 tokens): use CPU if max_tokens < 100 { let cpu_workers = workers.filter(|w| w.backend == "CPU" && w.status == "ready" ); if !cpu_workers.is_empty() { return cpu_workers[0]; } } // Medium queries (100-500 tokens): use any GPU if max_tokens < 500 { let gpu_workers = workers.filter(|w| w.backend != "CPU" && w.status == "ready" ); if !gpu_workers.is_empty() { // Sort by cost per hour (cheapest first) return gpu_workers.sort_by(|a, b| a.cost_per_hour < b.cost_per_hour )[0]; } } // Large queries: use high-performance GPU let fast_workers = workers.filter(|w| w.performance > 100 && w.status == "ready" ); return fast_workers.sort_by(|a, b| a.load < b.load)[0];}Example 5: Time-Based Routing for Energy Efficiency
Use cheaper CPU workers during off-peak hours, activate GPUs during peak demand:
fn route_request(request, workers) { // Get current hour (0-23) let hour = timestamp().hour(); // Off-peak hours (midnight to 6am): use CPU to save power if hour >= 0 && hour < 6 { let cpu_workers = workers.filter(|w| w.backend == "CPU" && w.status == "ready" ); if !cpu_workers.is_empty() { return cpu_workers[0]; } } // Peak hours (9am to 5pm): use all available GPUs if hour >= 9 && hour < 17 { let gpu_workers = workers.filter(|w| w.backend != "CPU" && w.status == "ready" ); return gpu_workers.sort_by(|a, b| a.load < b.load)[0]; } // Default: round-robin return workers.filter(|w| w.status == "ready")[0];}Example 6: Region-Based Routing for Multi-Location Deployments
Route to workers in the same region for lower latency and better user experience:
fn route_request(request, workers) { let user_region = request.metadata.get("region"); // Try to find worker in same region if user_region != () { let local_workers = workers.filter(|w| w.metadata.get("region") == user_region && w.status == "ready" ); if !local_workers.is_empty() { return local_workers.sort_by(|a, b| a.load < b.load)[0]; } } // Fallback to any available worker return workers.filter(|w| w.status == "ready")[0];}Example 7: Marketplace Worker Routing for Multi-Modal AI
Route requests to specific marketplace workers based on task type (LLM inference, image generation, etc):
fn route_request(request, workers) { let request_type = request.metadata.get("type"); // Route image generation to SD workers if request_type == "image" { let sd_workers = workers.filter(|w| w.id.starts_with("sd-worker-") && w.status == "ready" ); if !sd_workers.is_empty() { return sd_workers.sort_by(|a, b| a.load < b.load)[0]; } } // Route LLM inference to LLM workers if request_type == "llm" { let llm_workers = workers.filter(|w| w.id.starts_with("llm-worker-") && w.status == "ready" ); if !llm_workers.is_empty() { return llm_workers.sort_by(|a, b| a.load < b.load)[0]; } } // Default: any available worker return workers.filter(|w| w.status == "ready")[0];}Testing Your Routing Script: Simulation and Benchmarking
Test your routing logic before deploying to production. rbee provides built-in simulation tools:
# Test routing scriptrbee routing test --script ~/.config/rbee/routing.rhai
# Simulate requestsrbee routing simulate \ --script ~/.config/rbee/routing.rhai \ --requests 1000 \ --report
# Output:# Worker Distribution:# worker-gaming-pc: 450 requests (45%)# worker-mac-studio: 350 requests (35%)# worker-cpu: 200 requests (20%)# Average latency: 120msBuilt-in Helper Functions for Rhai Routing Scripts
rbee provides powerful helper functions to simplify complex routing logic:
// Random number (0.0 - 1.0)let r = rand();
// Current timestamplet now = timestamp();let hour = now.hour();let day = now.day();
// Hash a string (for consistent routing)let hash = hash_str("user@example.com");let worker_index = hash % workers.len();
// Filter and sortlet available = workers.filter(|w| w.status == "ready");let sorted = available.sort_by(|a, b| a.load < b.load);
// Array operationslet first = workers[0];let last = workers[-1];let count = workers.len();let contains = workers.contains(some_worker);Error Handling: Graceful Fallbacks in Routing Scripts
Always handle edge cases to ensure requests never fail. Implement graceful fallbacks:
fn route_request(request, workers) { // Check if any workers are available if workers.is_empty() { throw "No workers available"; } let available = workers.filter(|w| w.status == "ready"); // Fallback if no workers are ready if available.is_empty() { // Return first worker (will queue the request) return workers[0]; } // Your routing logic here return available.sort_by(|a, b| a.load < b.load)[0];}Performance Tips: Optimizing Rhai Routing for Low Latency
- Keep scripts simple: Complex logic adds latency to every request
- Use early returns: Exit as soon as you find a suitable worker
- Prefer filter/sort: Avoid loops when possible
- Test with rbee routing simulate: Benchmark before deploying to production
- Monitor routing metrics: Track worker distribution and latency impact
Debugging Rhai Routing Scripts: Logging and Monitoring
Enable debug logging to see routing decisions and troubleshoot issues:
# In ~/.config/rbee/config.toml[routing]script = "~/.config/rbee/routing.rhai"debug = true # Log routing decisions
# View logstail -f ~/.config/rbee/logs/routing.logBest Practices for Production Rhai Routing
- Start simple: Begin with basic routing, add complexity as needed
- Test thoroughly: Use
rbee routing simulatebefore deploying - Monitor metrics: Track worker utilization and request latency
- Have fallbacks: Always return a worker, even if conditions aren't met
- Document your logic: Add comments explaining routing decisions
Use Rhai routing to monetize your GPU farm. Route premium customers to your fastest GPUs, free users to CPU, and sell excess capacity on the marketplace. This is how GPU operators generate recurring revenue with rbee.
Why Rhai Beats the Competition
Ollama
No routing. Single machine only. Can't implement custom logic.
vLLM
Basic load balancing only. Requires Kubernetes. No custom routing.
rbee (Premium)
Full Rhai scripting. Multi-machine. Unlimited custom logic. No Kubernetes.
Cloud APIs
Fixed routing. Vendor lock-in. $1,500-3,000/month for same workload.
Ready to master production AI routing?
Advanced Rhai routing is included in Premium Queen (€129 lifetime) and Queen + Worker Bundle (€279 lifetime). Pre-launch pricing available now. After Q2 2026, pricing moves to monthly subscription.