Heterogeneous Hardware: Run NVIDIA, Apple, AMD GPUs Together

The Reality of Mixed Hardware: Stop Vendor Lock-in

Most developers don't have a uniform GPU fleet. You might have an NVIDIA RTX 4090 in your gaming PC, a Mac Studio with M2 Ultra, an old server with AMD GPUs, and maybe some cloud instances with Tesla cards. Each uses a different compute framework. Traditional orchestration platforms force you to pick one vendor. rbee lets you use them all together in a single system.

NVIDIA

CUDA compute framework

Apple Silicon

Metal compute framework

AMD

ROCm compute framework

CPU-only

Fallback execution

Compute backends supported

100%

Hardware utilization

Vendor lock-in

✅ Why Heterogeneous Hardware Matters

Most developers have mixed hardware: a gaming PC with NVIDIA, a Mac with Apple Silicon, maybe an old AMD server. rbee lets you orchestrate them all together, maximizing your existing hardware investment instead of forcing you to standardize on one vendor.

How rbee Handles Heterogeneous Hardware: Three Key Features

1. Automatic Backend Detection

When a worker starts, rbee automatically detects available compute backends:

Worker Startup Log

# Worker startup log
[INFO] Detecting compute backends...
[INFO] ✓ CUDA 12.1 detected (NVIDIA RTX 4090, 24GB VRAM)
[INFO] ✓ cuBLAS, cuDNN available
[INFO] Worker registered as: worker-gaming-pc (CUDA)

2. Model Format Compatibility

rbee uses GGUF format (from llama.cpp), which works across all backends:

GGUF Format Support

Feature	Backend	Framework
NVIDIA CUDA	NVIDIA CUDA	llama.cpp + CUDA
Apple Metal	Apple Metal	llama.cpp + Metal
AMD ROCm	AMD ROCm	llama.cpp + ROCm
CPU	CPU	llama.cpp (AVX2/AVX512)

3. Smart Routing Based on Hardware Capabilities

The keeper routes requests based on worker capabilities:

Capability-Aware Routing

# Example: Route 70B model to high-VRAM workers
[routing]
strategy = "capability-aware"

[[routing.rules]]
model_size = ">40B"
prefer_backends = ["CUDA"]  # Prefer NVIDIA for large models
min_vram = 40  # Require 40GB+ VRAM

[[routing.rules]]
model_size = "<10B"
prefer_backends = ["Metal", "CUDA", "ROCm"]  # Any GPU works
min_vram = 8

Real-World Example: Mixed Hardware Setup

Let's set up a system with different GPU types:

Machine 1: Gaming PC (NVIDIA RTX 4090)

Setup NVIDIA GPU Worker

# Add gaming PC with NVIDIA GPU
rbee hive add gaming-pc 192.168.1.100

# Deploy worker (auto-detects CUDA)
rbee worker deploy --hive gaming-pc

# Download large models (70B fits in 24GB VRAM with Q4 quantization)
rbee model download llama-3.1-70b-q4 --hive gaming-pc

Machine 2: Mac Studio (Apple M2 Ultra)

Setup Apple Metal Worker

# Add Mac Studio
rbee hive add mac-studio 192.168.1.101

# Deploy worker (auto-detects Metal)
rbee worker deploy --hive mac-studio

# Download medium models (M2 Ultra has 128GB unified memory)
rbee model download llama-3.1-8b --hive mac-studio
rbee model download mistral-7b --hive mac-studio

Machine 3: Old Server (AMD Radeon)

Setup AMD ROCm Worker

# Add AMD server
rbee hive add amd-server 192.168.1.102

# Deploy worker (auto-detects ROCm)
rbee worker deploy --hive amd-server

# Download smaller models
rbee model download llama-3.1-8b --hive amd-server

Machine 4: CPU-Only Fallback

Setup CPU Worker

# Add CPU-only machine
rbee hive add cpu-server 192.168.1.103

# Deploy worker (uses CPU backend)
rbee worker deploy --hive cpu-server

# Download quantized models for CPU inference
rbee model download llama-3.1-8b-q4 --hive cpu-server

Performance Considerations: Optimize for Each Backend

VRAM vs Unified Memory

Different architectures have different memory characteristics:

NVIDIA CUDA:: Dedicated VRAM (fast, but limited size)
Apple Metal:: Unified memory (large capacity, shared with CPU)
AMD ROCm:: Dedicated VRAM (varies by card)
CPU:: System RAM (large, but slower)

Quantization Strategies

Choose quantization based on available VRAM:

VRAM Requirements by Quantization

Feature	Model Size	Q8 (8-bit)	Q5 (5-bit)	Q4 (4-bit)Recommended
7B params	7B params	~8GB VRAM	~5GB VRAM	~4GB VRAM
13B params	13B params	~14GB VRAM	~9GB VRAM	~7GB VRAM
70B params	70B params	~70GB VRAM	~44GB VRAM	~35GB VRAM

Inference Speed Comparison

Approximate tokens/second for Llama 3.1 8B (Q4):

~120

NVIDIA RTX 4090 (tokens/sec)

~90

AMD Radeon RX 7900 XTX (tokens/sec)

~80

Apple M2 Ultra (tokens/sec)

~15

CPU Ryzen 9 7950X (tokens/sec)

Advanced: Custom Routing Logic with Rhai

Use Rhai scripts to implement sophisticated routing:

routing.rhai

// ~/.config/rbee/routing.rhai
fn route_request(request, workers) {
    let model_size = request.model_size;
    
    // Large models: prefer NVIDIA with high VRAM
    if model_size > 40_000_000_000 {
        return workers.filter(|w| w.backend == "CUDA" && w.vram > 40);
    }
    
    // Medium models: prefer Metal or CUDA
    if model_size > 10_000_000_000 {
        return workers.filter(|w| w.backend in ["Metal", "CUDA"]);
    }
    
    // Small models: any GPU works
    return workers.filter(|w| w.backend != "CPU");
}

Troubleshooting Mixed Hardware: Common Issues

Backend Not Detected

Troubleshooting Backend Detection

# Check available backends
rbee worker info --hive HIVE_NAME

# Force specific backend
rbee worker start --hive HIVE_NAME --backend cuda

# Install missing dependencies
# CUDA: Install CUDA Toolkit + cuDNN
# Metal: Included with macOS (no install needed)
# ROCm: Install ROCm drivers

Performance Issues

Slow inference: Check quantization level (Q4 is fastest)
Out of memory: Use smaller model or higher quantization
CPU fallback: Ensure GPU drivers are installed correctly

Best Practices: Maximize Hardware Utilization

Match Models to Hardware

Large models on high-VRAM GPUs, small models anywhere

Use Quantization Wisely

Q4 for most use cases, Q8 for quality-critical tasks

Monitor Utilization

Use rbee worker stats to see GPU usage and performance

CPU Workers as Fallback

Always have a CPU worker for reliability and overflow

Test Routing Logic

Verify requests go to the right workers with benchmarks

Benchmark Your Hardware

Measure actual performance to make informed routing decisions

💡 Pro Tip

Use rbee worker benchmark to measure actual performance on your hardware. This helps you make informed routing decisions and optimize your routing rules.

Stop vendor lock-in. Mix and match your GPU hardware with rbee

Heterogeneous hardware support is built into rbee core. No premium tier required. Use NVIDIA, Apple, AMD, and CPU workers together in one system.

🚀 Setup Guide 📚 Hardware Docs ⭐ Star on GitHub

📚 Continue Reading

→ Introducing rbee: Why We Built It → How to Set Up Multi-Machine GPU Orchestration → Cost Comparison: Self-Hosted vs Cloud → Advanced Routing with Rhai Scripting