Heterogeneous Hardware: Run NVIDIA, Apple, and AMD GPUs Together
Stop being locked into one GPU vendor. Mix NVIDIA CUDA, Apple Metal, and AMD ROCm in one system. rbee orchestrates heterogeneous hardware seamlessly. No vendor lock-in. Maximum hardware utilization.
The Reality of Mixed Hardware: Stop Vendor Lock-in
Most developers don't have a uniform GPU fleet. You might have an NVIDIA RTX 4090 in your gaming PC, a Mac Studio with M2 Ultra, an old server with AMD GPUs, and maybe some cloud instances with Tesla cards. Each uses a different compute framework. Traditional orchestration platforms force you to pick one vendor. rbee lets you use them all together in a single system.
Most developers have mixed hardware: a gaming PC with NVIDIA, a Mac with Apple Silicon, maybe an old AMD server. rbee lets you orchestrate them all together, maximizing your existing hardware investment instead of forcing you to standardize on one vendor.
How rbee Handles Heterogeneous Hardware: Three Key Features
1. Automatic Backend Detection
When a worker starts, rbee automatically detects available compute backends:
# Worker startup log[INFO] Detecting compute backends...[INFO] ✓ CUDA 12.1 detected (NVIDIA RTX 4090, 24GB VRAM)[INFO] ✓ cuBLAS, cuDNN available[INFO] Worker registered as: worker-gaming-pc (CUDA)2. Model Format Compatibility
rbee uses GGUF format (from llama.cpp), which works across all backends:
GGUF Format Support
| Feature | Backend | Framework | GGUF Support |
|---|---|---|---|
| NVIDIA CUDA | NVIDIA CUDA | llama.cpp + CUDA | |
| Apple Metal | Apple Metal | llama.cpp + Metal | |
| AMD ROCm | AMD ROCm | llama.cpp + ROCm | |
| CPU | CPU | llama.cpp (AVX2/AVX512) |
3. Smart Routing Based on Hardware Capabilities
The keeper routes requests based on worker capabilities:
# Example: Route 70B model to high-VRAM workers[routing]strategy = "capability-aware"
[[routing.rules]]model_size = ">40B"prefer_backends = ["CUDA"] # Prefer NVIDIA for large modelsmin_vram = 40 # Require 40GB+ VRAM
[[routing.rules]]model_size = "<10B"prefer_backends = ["Metal", "CUDA", "ROCm"] # Any GPU worksmin_vram = 8Real-World Example: Mixed Hardware Setup
Let's set up a system with different GPU types:
Machine 1: Gaming PC (NVIDIA RTX 4090)
# Add gaming PC with NVIDIA GPUrbee hive add gaming-pc 192.168.1.100
# Deploy worker (auto-detects CUDA)rbee worker deploy --hive gaming-pc
# Download large models (70B fits in 24GB VRAM with Q4 quantization)rbee model download llama-3.1-70b-q4 --hive gaming-pcMachine 2: Mac Studio (Apple M2 Ultra)
# Add Mac Studiorbee hive add mac-studio 192.168.1.101
# Deploy worker (auto-detects Metal)rbee worker deploy --hive mac-studio
# Download medium models (M2 Ultra has 128GB unified memory)rbee model download llama-3.1-8b --hive mac-studiorbee model download mistral-7b --hive mac-studioMachine 3: Old Server (AMD Radeon)
# Add AMD serverrbee hive add amd-server 192.168.1.102
# Deploy worker (auto-detects ROCm)rbee worker deploy --hive amd-server
# Download smaller modelsrbee model download llama-3.1-8b --hive amd-serverMachine 4: CPU-Only Fallback
# Add CPU-only machinerbee hive add cpu-server 192.168.1.103
# Deploy worker (uses CPU backend)rbee worker deploy --hive cpu-server
# Download quantized models for CPU inferencerbee model download llama-3.1-8b-q4 --hive cpu-serverPerformance Considerations: Optimize for Each Backend
VRAM vs Unified Memory
Different architectures have different memory characteristics:
- NVIDIA CUDA:
- Dedicated VRAM (fast, but limited size)
- Apple Metal:
- Unified memory (large capacity, shared with CPU)
- AMD ROCm:
- Dedicated VRAM (varies by card)
- CPU:
- System RAM (large, but slower)
Quantization Strategies
Choose quantization based on available VRAM:
VRAM Requirements by Quantization
| Feature | Model Size | Q8 (8-bit) | Q5 (5-bit) | Q4 (4-bit)Recommended |
|---|---|---|---|---|
| 7B params | 7B params | ~8GB VRAM | ~5GB VRAM | ~4GB VRAM |
| 13B params | 13B params | ~14GB VRAM | ~9GB VRAM | ~7GB VRAM |
| 70B params | 70B params | ~70GB VRAM | ~44GB VRAM | ~35GB VRAM |
Inference Speed Comparison
Approximate tokens/second for Llama 3.1 8B (Q4):
Advanced: Custom Routing Logic with Rhai
Use Rhai scripts to implement sophisticated routing:
// ~/.config/rbee/routing.rhaifn route_request(request, workers) { let model_size = request.model_size; // Large models: prefer NVIDIA with high VRAM if model_size > 40_000_000_000 { return workers.filter(|w| w.backend == "CUDA" && w.vram > 40); } // Medium models: prefer Metal or CUDA if model_size > 10_000_000_000 { return workers.filter(|w| w.backend in ["Metal", "CUDA"]); } // Small models: any GPU works return workers.filter(|w| w.backend != "CPU");}Troubleshooting Mixed Hardware: Common Issues
Backend Not Detected
# Check available backendsrbee worker info --hive HIVE_NAME
# Force specific backendrbee worker start --hive HIVE_NAME --backend cuda
# Install missing dependencies# CUDA: Install CUDA Toolkit + cuDNN# Metal: Included with macOS (no install needed)# ROCm: Install ROCm driversPerformance Issues
- Slow inference: Check quantization level (Q4 is fastest)
- Out of memory: Use smaller model or higher quantization
- CPU fallback: Ensure GPU drivers are installed correctly
Best Practices: Maximize Hardware Utilization
Use rbee worker benchmark to measure actual performance on your hardware. This helps you make informed routing decisions and optimize your routing rules.
Stop vendor lock-in. Mix and match your GPU hardware with rbee
Heterogeneous hardware support is built into rbee core. No premium tier required. Use NVIDIA, Apple, AMD, and CPU workers together in one system.