GPU Selection for Local AI Agent Workloads
GPU Selection for Local AI Agent Workloads
Running AI agents locally means running inference locally. Your GPU choice determines how fast your agent thinks, how large a model it can load, and whether the whole setup is practical for daily use. The hardware landscape shifted significantly in 2025, and the numbers are worth knowing before making a purchase decision.
What "Fast Enough" Means for Agents
First, the practical threshold. For an agent interaction to feel responsive rather than laggy, you need roughly 20 to 30 tokens per second for conversational output. Below that, you start to notice the generation. Below 10 tokens per second, it becomes noticeably slow for complex tasks that output long responses.
For agent tasks that run in the background - processing files, executing multi-step workflows, analyzing data - latency matters less. Throughput matters. How much can the model process per minute?
Keep these thresholds in mind as you look at the benchmark numbers.
Apple Silicon - The Unified Memory Advantage
For macOS AI agents, Apple Silicon is the natural choice. The defining feature is unified memory architecture: the CPU, GPU, and Neural Engine share the same memory pool. There is no separate VRAM limit.
This matters enormously for model size. A Mac with 64GB unified memory can load a 40B parameter quantized model that would require $1,500+ of NVIDIA VRAM to run on a PC.
M4 Max (128GB unified memory)
- Llama 3.1 8B Q4: approximately 80-100 tokens/second
- Llama 3.1 70B Q4: approximately 12-18 tokens/second
- Power draw under inference load: 40-60W
M3 Ultra (192GB unified memory)
- Llama 3.1 8B Q4: approximately 76 tokens/second
- Llama 3.1 70B Q4: approximately 20-25 tokens/second
- Can run 100B+ parameter models that no single consumer GPU can fit
The M4 Pro (48GB max) is the practical minimum for running 13B+ models comfortably. The M4 Max is the sweet spot for daily use. The M-series Ultra chips are for users who need 70B model performance without a dedicated GPU server.
For agents specifically running on macOS (like Fazm), Apple Silicon is not just convenient - it is also where the framework support is best tuned, including Metal acceleration and Core ML integration.
NVIDIA - Raw Throughput Leader
If you are running a Linux homelab or want maximum tokens-per-second on a specific model:
RTX 5090 (32GB GDDR7)
- Released in early 2025, currently the fastest consumer GPU for inference
- Llama 3.1 8B: 213 tokens/second
- Qwen2.5-Coder 7B: 5,841 tokens/second (optimized inference server)
- 32B parameter models: approximately 61 tokens/second
- Power draw: 575W TDP
The RTX 5090 delivers 72% better throughput than the RTX 4090 on NLP tasks, primarily due to the larger GDDR7 memory bandwidth (1.8 TB/s vs 1.0 TB/s for the 4090).
RTX 4090 (24GB GDDR6X) - Still the value leader for most workloads
- Llama 3.1 8B: approximately 120-140 tokens/second
- Fits 13B models at full precision, larger models with quantization
- Available used for significantly less than the 5090 MSRP
Dual RTX 3090 / 4090 setups - 48GB combined VRAM, comparable speed to a single 5090 at lower cost. Adds complexity (tensor parallelism configuration, double power draw) but can be worth it for fitting 34B-40B models at higher precision.
The CUDA ecosystem remains the benchmark for software support. Most inference frameworks (llama.cpp, vLLM, TGI) are tested on NVIDIA first.
AMD - Improving but Still Catching Up
AMD's consumer GPUs offer competitive price-per-VRAM but lag in software ecosystem maturity.
RX 7900 XTX (24GB GDDR6)
- Comparable VRAM to the RTX 4090 at lower cost
- ROCm support has improved significantly in 2024-2025, but still requires more configuration than CUDA
- Inference performance is generally 15-30% behind equivalent NVIDIA hardware on most frameworks
- Best for technically confident users willing to debug occasional compatibility issues
RX 9070 XT (16GB GDDR6) - The 2025 mid-range option
- 16GB is the practical minimum for running 13B models without heavy quantization
- ROCm 6.x has notably better llama.cpp support than earlier versions
- Suitable if cost is the primary constraint and you are comfortable on Linux
AMD is not recommended for primary inference workloads if you want a low-maintenance setup. If you already have AMD hardware or are cost-constrained, it works, but expect to spend time on configuration.
Quantization: Where Memory and Speed Meet
Raw model size in parameters is not the right number to optimize against. Quantized models run faster and fit in less memory with minimal accuracy loss.
A practical reference for Llama 3.1 70B:
| Quantization | Model Size | Tokens/sec (RTX 4090) | Quality vs FP16 |
|---|---|---|---|
| FP16 | 140GB | Does not fit | Baseline |
| Q8 | 70GB | Does not fit | ~99% |
| Q5_K_M | 48GB | Does not fit | ~97% |
| Q4_K_M | 40GB | Does not fit | ~95% |
| Q3_K_M | 30GB | ~8 tokens/sec | ~90% |
For the RTX 4090 with 24GB VRAM, even Q4_K_M at 70B does not fit. You need either an Apple Silicon Mac with 64GB+ unified memory, a 5090, or a dual-GPU setup.
For the 13B class of models that most desktop agents actually use:
| Quantization | Model Size | Fits in 24GB VRAM | Tokens/sec (RTX 4090) |
|---|---|---|---|
| FP16 | 26GB | No | - |
| Q8 | 13GB | Yes | ~90 |
| Q4_K_M | 8GB | Yes | ~140 |
Q4_K_M is the practical default for most agent workloads. You lose roughly 3-5% on standard benchmarks compared to FP16, but you run 1.5-2x faster and fit in substantially less memory.
What Actually Matters for Agent Workloads Specifically
AI agents do not just run model inference - they also process screenshots, parse accessibility trees, call external APIs, and manage tool orchestration. The model inference is often the bottleneck, but the full picture matters:
Memory capacity over raw speed. Being able to load a 30B parameter model beats running an 8B model faster. The quality difference is significant for complex reasoning tasks.
First-token latency. For interactive agent conversations, how quickly the first token appears matters more than sustained throughput. Apple Silicon performs relatively well here due to unified memory bandwidth.
Sustained throughput for batch tasks. For background processing - analyzing a directory of files, batch summarization - sustained tokens/second matters more. NVIDIA has a clear edge here.
The 500ms threshold. For agent interactions to feel interactive rather than slow, the first token should arrive within about 500ms of the prompt completing. On a well-tuned M4 Max or RTX 4090 running an 8B model, this is easily achievable.
Practical Recommendation
For macOS users running agents locally: the Mac you already have is a reasonable starting point. If you are buying new hardware for this purpose, the M4 Max with 64GB unified memory covers most use cases - 30B models at interactive speed, 70B models at background-task speed, no GPU maintenance.
For Linux homelab users who want maximum inference performance: the RTX 4090 (used market) or RTX 5090 (new) gives substantially higher throughput on 8B to 30B models. Budget 450-575W of power and a well-ventilated case.
Do not buy dedicated AI hardware before identifying your specific bottleneck. Most agent setups are bottlenecked on network (calling cloud APIs), tool execution (file system, browser automation), or prompt quality long before they are bottlenecked on local inference speed.
Fazm is an open source macOS AI agent. Open source on GitHub.