GPU Selection for Local AI Agent Workloads

Matthew Diakonov

Updated March 30, 2026

gpu local-ai hardware llm-inference apple-silicon

GPU Selection for Local AI Agent Workloads

Running AI agents locally means running inference locally. Your GPU choice determines how fast your agent thinks, how large a model it can load, and whether the whole setup is practical for daily use. The hardware landscape shifted significantly in 2025, and the numbers are worth knowing before making a purchase decision.

What "Fast Enough" Means for Agents

First, the practical threshold. For an agent interaction to feel responsive rather than laggy, you need roughly 20 to 30 tokens per second for conversational output. Below that, you start to notice the generation. Below 10 tokens per second, it becomes noticeably slow for complex tasks that output long responses.

For agent tasks that run in the background - processing files, executing multi-step workflows, analyzing data - latency matters less. Throughput matters. How much can the model process per minute?

Keep these thresholds in mind as you look at the benchmark numbers.

Apple Silicon - The Unified Memory Advantage

For macOS AI agents, Apple Silicon is the natural choice. The defining feature is unified memory architecture: the CPU, GPU, and Neural Engine share the same memory pool. There is no separate VRAM limit.

This matters enormously for model size. A Mac with 64GB unified memory can load a 40B parameter quantized model that would require $1,500+ of NVIDIA VRAM to run on a PC.

M4 Max (128GB unified memory)

Llama 3.1 8B Q4: approximately 80-100 tokens/second
Llama 3.1 70B Q4: approximately 12-18 tokens/second
Power draw under inference load: 40-60W

M3 Ultra (192GB unified memory)

Llama 3.1 8B Q4: approximately 76 tokens/second
Llama 3.1 70B Q4: approximately 20-25 tokens/second
Can run 100B+ parameter models that no single consumer GPU can fit

The M4 Pro (48GB max) is the practical minimum for running 13B+ models comfortably. The M4 Max is the sweet spot for daily use. The M-series Ultra chips are for users who need 70B model performance without a dedicated GPU server.

For agents specifically running on macOS (like Fazm), Apple Silicon is not just convenient - it is also where the framework support is best tuned, including Metal acceleration and Core ML integration.

NVIDIA - Raw Throughput Leader

If you are running a Linux homelab or want maximum tokens-per-second on a specific model:

RTX 5090 (32GB GDDR7)

Released in early 2025, currently the fastest consumer GPU for inference
Llama 3.1 8B: 213 tokens/second
Qwen2.5-Coder 7B: 5,841 tokens/second (optimized inference server)
32B parameter models: approximately 61 tokens/second
Power draw: 575W TDP

The RTX 5090 delivers 72% better throughput than the RTX 4090 on NLP tasks, primarily due to the larger GDDR7 memory bandwidth (1.8 TB/s vs 1.0 TB/s for the 4090).

RTX 4090 (24GB GDDR6X) - Still the value leader for most workloads

Llama 3.1 8B: approximately 120-140 tokens/second
Fits 13B models at full precision, larger models with quantization
Available used for significantly less than the 5090 MSRP

Dual RTX 3090 / 4090 setups - 48GB combined VRAM, comparable speed to a single 5090 at lower cost. Adds complexity (tensor parallelism configuration, double power draw) but can be worth it for fitting 34B-40B models at higher precision.

The CUDA ecosystem remains the benchmark for software support. Most inference frameworks (llama.cpp, vLLM, TGI) are tested on NVIDIA first.

AMD - Improving but Still Catching Up

AMD's consumer GPUs offer competitive price-per-VRAM but lag in software ecosystem maturity.

RX 7900 XTX (24GB GDDR6)

Comparable VRAM to the RTX 4090 at lower cost
ROCm support has improved significantly in 2024-2025, but still requires more configuration than CUDA
Inference performance is generally 15-30% behind equivalent NVIDIA hardware on most frameworks
Best for technically confident users willing to debug occasional compatibility issues

RX 9070 XT (16GB GDDR6) - The 2025 mid-range option

16GB is the practical minimum for running 13B models without heavy quantization
ROCm 6.x has notably better llama.cpp support than earlier versions
Suitable if cost is the primary constraint and you are comfortable on Linux

AMD is not recommended for primary inference workloads if you want a low-maintenance setup. If you already have AMD hardware or are cost-constrained, it works, but expect to spend time on configuration.

Quantization: Where Memory and Speed Meet

Raw model size in parameters is not the right number to optimize against. Quantized models run faster and fit in less memory with minimal accuracy loss.

A practical reference for Llama 3.1 70B:

Quantization	Model Size	Tokens/sec (RTX 4090)	Quality vs FP16
FP16	140GB	Does not fit	Baseline
Q8	70GB	Does not fit	~99%
Q5_K_M	48GB	Does not fit	~97%
Q4_K_M	40GB	Does not fit	~95%
Q3_K_M	30GB	~8 tokens/sec	~90%

For the RTX 4090 with 24GB VRAM, even Q4_K_M at 70B does not fit. You need either an Apple Silicon Mac with 64GB+ unified memory, a 5090, or a dual-GPU setup.

For the 13B class of models that most desktop agents actually use:

Quantization	Model Size	Fits in 24GB VRAM	Tokens/sec (RTX 4090)
FP16	26GB	No	-
Q8	13GB	Yes	~90
Q4_K_M	8GB	Yes	~140

Q4_K_M is the practical default for most agent workloads. You lose roughly 3-5% on standard benchmarks compared to FP16, but you run 1.5-2x faster and fit in substantially less memory.

What Actually Matters for Agent Workloads Specifically

AI agents do not just run model inference - they also process screenshots, parse accessibility trees, call external APIs, and manage tool orchestration. The model inference is often the bottleneck, but the full picture matters:

Memory capacity over raw speed. Being able to load a 30B parameter model beats running an 8B model faster. The quality difference is significant for complex reasoning tasks.

First-token latency. For interactive agent conversations, how quickly the first token appears matters more than sustained throughput. Apple Silicon performs relatively well here due to unified memory bandwidth.

Sustained throughput for batch tasks. For background processing - analyzing a directory of files, batch summarization - sustained tokens/second matters more. NVIDIA has a clear edge here.

The 500ms threshold. For agent interactions to feel interactive rather than slow, the first token should arrive within about 500ms of the prompt completing. On a well-tuned M4 Max or RTX 4090 running an 8B model, this is easily achievable.

Practical Recommendation

For macOS users running agents locally: the Mac you already have is a reasonable starting point. If you are buying new hardware for this purpose, the M4 Max with 64GB unified memory covers most use cases - 30B models at interactive speed, 70B models at background-task speed, no GPU maintenance.

For Linux homelab users who want maximum inference performance: the RTX 4090 (used market) or RTX 5090 (new) gives substantially higher throughput on 8B to 30B models. Budget 450-575W of power and a well-ventilated case.

Do not buy dedicated AI hardware before identifying your specific bottleneck. Most agent setups are bottlenecked on network (calling cloud APIs), tool execution (file system, browser automation), or prompt quality long before they are bottlenecked on local inference speed.

Fazm is an open source macOS AI agent. Open source on GitHub.

GPU Selection for Local AI Agent Workloads

GPU Selection for Local AI Agent Workloads

What "Fast Enough" Means for Agents

Apple Silicon - The Unified Memory Advantage

NVIDIA - Raw Throughput Leader

AMD - Improving but Still Catching Up

Quantization: Where Memory and Speed Meet

What Actually Matters for Agent Workloads Specifically

Practical Recommendation

More on This Topic

Related Posts

Codex-Like Functionality with Local Ollama - Qwen 3 32B Is the Sweet Spot

macOS Dictation with Local Whisper - Sub-Second Latency on Apple Silicon

First Speculative Decoding Across GPU and Neural Engine on Apple Silicon