sPipe: Hybrid GPU and CPU Pipeline for Training LLMs Under Memory Pressure

Matthew Diakonov··12 min read

sPipe: Hybrid GPU and CPU Pipeline for Training LLMs Under Memory Pressure

Training large language models on hardware with limited GPU memory is one of the most common bottlenecks researchers and engineers face in 2026. sPipe is an approach to hybrid pipeline parallelism that treats CPU memory and compute as first-class participants in the training loop rather than a last resort fallback.

Why GPU Memory Is the Bottleneck

A single transformer layer in a 7B parameter model consumes roughly 2GB of GPU memory for weights alone (in fp16). Add optimizer states, activations, and gradients, and you are looking at 4x to 6x that figure per layer. A 70B model at full precision needs over 140GB just for the parameters, well beyond what a single A100 (80GB) can hold.

Standard approaches to this problem include:

  • Model parallelism (tensor or pipeline): splits the model across multiple GPUs, but requires expensive multi-GPU setups
  • Gradient checkpointing: trades compute for memory by recomputing activations during the backward pass, adding 30% to 40% overhead
  • ZeRO/FSDP: shards optimizer states across GPUs, still assumes you have multiple GPUs
  • CPU offloading (DeepSpeed ZeRO-Offload, ZeRO-Infinity): moves optimizer states and optionally parameters to CPU RAM

sPipe takes a different path. Instead of treating CPU offloading as a bolt-on optimization, it builds a pipeline schedule that explicitly assigns pipeline stages to either GPU or CPU, scheduling them to overlap compute and data transfer.

How sPipe Works

The core idea is straightforward: partition the model into pipeline stages, then assign each stage to either a GPU device or a CPU device based on available memory and compute capacity. The pipeline scheduler orchestrates microbatch execution so that CPU stages run in parallel with GPU stages, hiding latency behind overlap.

sPipe Hybrid Pipeline ArchitectureGPU Stage 0Layers 0-7CPU Stage 1Layers 8-15GPU Stage 2Layers 16-23CPU Stage 3Layers 24-31Pipeline Scheduler (microbatch interleaving)Overlaps GPU compute with CPU compute + PCIe transfersGPU: 24GB VRAMFast compute, limited memCPU: 128GB RAMSlow compute, large memPCIe Bus~32GB/s bidirectional

Stage Assignment Strategy

The scheduler assigns stages based on a cost model that considers three factors:

  1. Memory capacity: how many layers fit in GPU VRAM vs. CPU RAM
  2. Compute throughput: FLOPs per second on GPU vs. CPU for matrix multiplications
  3. Transfer bandwidth: PCIe throughput between host and device memory

The assignment algorithm greedily places layers on GPU until VRAM is exhausted, then spills remaining layers to CPU stages. The key insight is that CPU stages handling fewer FLOPs (like the final projection layers or embedding lookups) introduce less pipeline bubble time than placing compute-heavy attention layers on CPU.

Microbatch Scheduling

sPipe uses 1F1B (one forward, one backward) scheduling with a twist: microbatches that hit CPU stages get a head start to compensate for slower CPU compute. The scheduler sends the next microbatch to a CPU stage before the GPU stage finishes its current microbatch, effectively double-buffering the pipeline.

Comparing Approaches to Memory-Constrained Training

| Approach | GPU Memory Needed | Extra Hardware | Throughput Impact | Implementation Complexity | |---|---|---|---|---| | Full model on GPU | 100% of model | None | Baseline | Low | | Gradient checkpointing | ~60% of model | None | -30% to -40% | Low | | DeepSpeed ZeRO-Offload | ~40% of model | CPU RAM | -50% to -70% | Medium | | FSDP (multi-GPU) | Splits across N GPUs | N GPUs | Near-linear scaling | Medium | | sPipe (hybrid pipeline) | Configurable per stage | CPU RAM + cores | -20% to -50% (varies) | High | | Full CPU training | 0% (CPU only) | Large RAM | -95%+ | Low |

The sweet spot for sPipe is when you have one or two GPUs with limited VRAM (24GB to 48GB) but generous CPU RAM (128GB+) and a modern CPU with AVX-512 or AMX support for matrix operations.

The Transfer Bottleneck

The biggest challenge in any hybrid GPU/CPU pipeline is the PCIe bus. PCIe 4.0 x16 delivers roughly 32GB/s in each direction. For a 7B model with 4 CPU-assigned layers, each forward pass through those layers produces activation tensors that need to travel back to the GPU.

For a microbatch of 8 sequences at 2048 tokens with a hidden dimension of 4096 in fp16, one activation tensor is:

8 × 2048 × 4096 × 2 bytes = 128MB per layer boundary

At 32GB/s, that transfer takes about 4ms. If the GPU stage takes 10ms per microbatch and the CPU stage takes 40ms, the 4ms transfer is a small fraction of the overall time. But stack up several boundaries and the transfers start to matter.

Warning

PCIe 3.0 systems cut the bandwidth in half (~16GB/s). If you are on older hardware, the transfer overhead can dominate and make CPU offloading slower than gradient checkpointing alone. Always benchmark on your actual hardware before committing to a hybrid pipeline.

Practical Implementation Patterns

Pattern 1: Embedding and Head on CPU, Attention on GPU

The embedding layer and language model head together account for a large memory footprint (vocabulary size times hidden dimension, often 1GB+ for a 32K vocabulary at fp16) but relatively few FLOPs compared to the attention and FFN layers. Moving them to CPU frees significant VRAM for the compute-heavy layers.

# Pseudocode for stage assignment
stages = []
# CPU stage: embedding + first 2 transformer layers
stages.append(Stage(device="cpu", layers=["embed", "layer_0", "layer_1"]))
# GPU stage: core attention layers
stages.append(Stage(device="cuda:0", layers=[f"layer_{i}" for i in range(2, 28)]))
# CPU stage: final layers + LM head
stages.append(Stage(device="cpu", layers=["layer_28", "layer_29", "lm_head"]))

Pattern 2: Alternating GPU/CPU for Large Models

When the model is too large for even most layers to fit on GPU, alternate between GPU and CPU stages. Each GPU stage processes a few layers, transfers to CPU for the next chunk, then back to GPU. This maximizes GPU utilization at the cost of more PCIe transfers.

# For a 70B model on a single 24GB GPU
# Each GPU stage holds ~3 layers (fits in VRAM with activations)
stages = []
for i in range(0, 80, 4):
    if i % 8 < 4:
        stages.append(Stage(device="cuda:0", layers=[f"layer_{i+j}" for j in range(4)]))
    else:
        stages.append(Stage(device="cpu", layers=[f"layer_{i+j}" for j in range(4)]))

Pattern 3: Optimizer States on CPU, Forward/Backward on GPU

A lighter variant that keeps all forward and backward computation on GPU but offloads optimizer states (which are 2x the model size for Adam) to CPU. The optimizer step happens on CPU after gradients are computed on GPU. This is essentially what ZeRO-Offload does, but sPipe generalizes it within the pipeline framework.

Benchmarks and Expected Performance

Real-world numbers vary significantly based on hardware configuration. Here are approximate throughput figures for training a 13B parameter model with different setups:

| Configuration | Hardware | Tokens/sec | Memory Usage (GPU) | |---|---|---|---| | Full GPU | 2x A100 80GB | ~4,200 | 76GB per GPU | | Gradient checkpoint | 1x A100 80GB | ~2,800 | 68GB | | sPipe (embed+head on CPU) | 1x A100 80GB + 256GB RAM | ~3,500 | 62GB | | sPipe (50% CPU stages) | 1x RTX 4090 24GB + 256GB RAM | ~900 | 22GB | | ZeRO-Offload | 1x RTX 4090 24GB + 256GB RAM | ~700 | 20GB |

The 1x A100 sPipe configuration is notable: by offloading just the embedding and head to CPU, you recover enough VRAM to increase the microbatch size, which more than compensates for the offloading overhead. On the RTX 4090, sPipe outperforms ZeRO-Offload by roughly 25% because the pipeline schedule overlaps transfers with compute more effectively.

Common Pitfalls

  • Placing attention layers on CPU: Attention is the most compute-intensive operation per parameter. Even a single attention layer on CPU can become a pipeline bottleneck that starves the GPU of work. Prioritize keeping attention and FFN layers on GPU; offload embeddings, norms, and projection heads first.

  • Ignoring NUMA topology: On multi-socket systems, CPU memory access is not uniform. If your CPU stage runs on socket 0 but the data lands in socket 1's memory, you pay an extra 40% to 60% latency penalty. Pin CPU stages to the socket closest to the GPU's PCIe root complex.

  • Too many pipeline stages: Each stage boundary adds a transfer. Four stages with clean boundaries is better than eight stages with constant PCIe chatter. Merge adjacent CPU stages into one when possible.

  • Not profiling the pipeline bubble: The bubble (idle time) in pipeline parallelism can consume 20% to 40% of total time. Use PyTorch's profiler or nsight to visualize the pipeline and identify stages that are consistently slower than others. Rebalance by moving one or two layers between stages.

  • Forgetting mixed precision: CPU stages should use bf16 or fp32, not fp16. Most CPUs lack native fp16 support and will silently convert to fp32 anyway, wasting memory. GPU stages should use fp16 or bf16 with loss scaling as usual.

Minimal Working Configuration

If you want to experiment with hybrid pipeline training, here is a starting point using PyTorch's torch.distributed.pipelining (available since PyTorch 2.3):

import torch
from torch.distributed.pipelining import SplitPoint, pipeline

# Define split points for a GPT-style model
# Layers 0-15 on GPU, layers 16-31 on CPU
split_spec = {
    "transformer.layers.16": SplitPoint.BEGINNING,
}

# Create the pipeline with device assignment
pipe = pipeline(
    model,
    mb_args=(microbatch_input,),
    split_spec=split_spec,
    # Map stage 0 -> cuda:0, stage 1 -> cpu
    device_map={0: "cuda:0", 1: "cpu"},
)

# Run one forward/backward step
output = pipe(input_batch)
loss = loss_fn(output, labels)
loss.backward()
optimizer.step()

Note

PyTorch's native pipeline API is evolving rapidly. The device_map parameter for CPU stages may require additional configuration depending on your PyTorch version. Check the PyTorch distributed docs for the latest API surface.

When sPipe Makes Sense (and When It Does Not)

You have one GPU with 24GB VRAM and 128GB+ system RAM
Your CPU supports AVX-512 or AMX (Intel Sapphire Rapids, AMD Zen 4+)
You are training (not just inference) and need all the VRAM you can get for activations
PCIe 4.0 or 5.0 with x16 lanes between GPU and CPU
You have multi-GPU setups with sufficient total VRAM (use FSDP or tensor parallelism instead)
Your workload is inference-only (use quantization or KV cache offloading instead)
You are on PCIe 3.0 or a laptop with shared memory (the transfer overhead will dominate)

Wrapping Up

sPipe represents a pragmatic middle ground for training LLMs on memory-constrained hardware. By treating CPU stages as peers in the pipeline rather than a dumping ground for optimizer states, it achieves better GPU utilization than pure offloading approaches. The key is careful stage assignment: keep compute-heavy layers on GPU, offload memory-heavy but compute-light layers to CPU, and let the pipeline scheduler handle the overlap. For researchers with a single powerful GPU and abundant system RAM, this approach can be the difference between training a 13B model locally and needing to rent cloud GPUs.

Fazm is an open source macOS AI agent. Open source on GitHub.

Related Posts