Opening (Pain Point & Analogy)

Ever hit this awkward stage where you move inference to a GPU, the GPU is smoking busy, the CPU is just standing by, and each run still takes 250-plus milliseconds? Think of it like a skewer shop: the head chef (GPU) is firing on all burners, while the prep cook (CPU) waits for a dish to come out before washing the next batch. What we want is for the prep cook to work while waiting, turning the kitchen into a true pipeline.

Principle Sketch

  • GPU dispatching is asynchronous: PyTorch keeps a CUDA task queue under the hood. model(dummy) merely pushes the “recipe” into the queue. By the time the GPU actually starts cooking, the CPU already has control back. A returned function only means the GPU accepted the job, not that it finished.
  • Sync calls slam the brakes: writes like tensor.cpu().numpy() or torch.cuda.synchronize() make PyTorch wait until the GPU is completely done, like stopping the prep cook to watch the stove.
  • Overlap is about decoupling: if CPU post-processing can live off the previous batch, it doesn’t need to block on the current GPU output. It’s a “prepare early, consume later” tactic.
  • Double buffering is the easiest win: keep a deque of length two. Let the GPU drop fresh tensors into the queue while the CPU consumes the oldest one, keeping both chefs busy.
  • Align the tail: there’s always a final dish left in the pipeline. Call torch.cuda.synchronize() at the end so timing doesn’t leave soup in the pot.

Look at the two timelines side by side:

Serial:   GPU[batch0 inference]---sync---CPU[batch0 post]---GPU[batch1 inference]---sync---CPU[batch1 post]
Overlap:  GPU[batch0 inference]---sync---CPU[batch0 post]
           \__CPU waiting zone handles previous batch__/
           GPU[batch1 inference]---sync---CPU[batch1 post]

In the second line the idle gaps shrink for both CPU and GPU—that’s the real time we reclaim.

Hands-on Steps

1. Environment Setup

python3 -m venv venv
source venv/bin/activate
pip install torch==2.3.0 torchvision==0.18.0

Run this on a Linux host with CUDA 11.8+ drivers and make sure nvidia-smi sees the GPU.

2. Build the “Serial” Baseline

Save as sequential.py:

import math
import statistics
import time
import torch
import torchvision.models as models

def heavy_cpu_work():
    acc = 0.0
    for i in range(500_000):
        acc += math.sqrt(i) * math.sin(i) * math.cos(i)
    return acc

@torch.inference_mode()
def main():
    device = "cuda"
    model = models.vit_b_16(weights=None).to(device).eval()
    dummy = torch.randn(10, 3, 224, 224, device=device)

    _ = model(dummy)  # Warm up

    durations = []
    for _ in range(50):
        start = time.perf_counter()
        output = model(dummy)
        torch.cuda.synchronize()
        heavy_cpu_work()
        durations.append(time.perf_counter() - start)

    print(f"Sequential avg latency: {statistics.mean(durations):.4f} s")

if __name__ == "__main__":
    main()

Run python sequential.py and you’ll see average latency around 0.26 s.

Code Walkthrough:

  • @torch.inference_mode() trims autograd overhead during inference so you don’t build computation graphs.
  • dummy = torch.randn(...) creates a fixed-shape sample so you can test without real inputs.
  • torch.cuda.synchronize() is the key brake: it waits for the GPU to finish before the CPU runs heavy_cpu_work(), making the flow entirely serial.
  • heavy_cpu_work() is a CPU-bound stand-in for real post-processing. In Python we don’t need a black_box helper; Rust folks will recognize the pattern though.

3. Rewrite as the “Overlap” Version

Save as overlap.py:

import collections
import math
import statistics
import time
import torch
import torchvision.models as models

def heavy_cpu_work(batch_id):
    acc = 0.0
    for i in range(500_000):
        value = batch_id * 31 + i
        acc += math.sqrt(value) * math.sin(value) * math.cos(value)
    return acc

@torch.inference_mode()
def main():
    device = "cuda"
    model = models.vit_b_16(weights=None).to(device).eval()
    dummy = torch.randn(10, 3, 224, 224, device=device)

    _ = model(dummy)
    torch.cuda.synchronize()

    backlog = collections.deque()
    durations = []

    for step in range(50):
        start = time.perf_counter()
        backlog.append((step, model(dummy)))

        if len(backlog) >= 2:
            old_step, _ = backlog.popleft()
            heavy_cpu_work(old_step)

        torch.cuda.synchronize()
        durations.append(time.perf_counter() - start)

    while backlog:
        last_step, _ = backlog.popleft()
        heavy_cpu_work(last_step)

    print(f"Overlap avg latency: {statistics.mean(durations):.4f} s")

if __name__ == "__main__":
    main()

Run python overlap.py and the average latency typically drops to about 0.17 s—roughly one-third faster for the same model.

Code Walkthrough:

  • backlog = collections.deque() is your serving tray. It stores tensors fresh from the GPU that the CPU hasn’t processed yet.
  • Inside the loop, the GPU grabs the next batch first: backlog.append((step, model(dummy))). Once the queue has at least two items, pop the oldest one for the CPU to chew on.
  • torch.cuda.synchronize() still exists, but it’s now after the CPU finishes the previous batch, so the GPU stays busy longer.
  • The final while backlog is there to flush the queue; otherwise the last dish stays on the counter.

4. Quick Comparison

python sequential.py
python overlap.py

Put the two results side by side to quantify what the pipelined kitchen gains you.

Common Pitfalls and Fixes

  • CPU still reading the current batch output? That instantly forces a sync. Fix: only touch tensors saved from earlier batches.
  • Forgot the final torch.cuda.synchronize()? You miss the tail latency. Fix: force one last sync when wrapping up.
  • CPU post-processing blocked on disk I/O? The pipeline stalls. Fix: move disk reads into another thread or use data loader prefetching.
  • VRAM tight? Bigger backlogs eat memory. Fix: keep the deque length at 2–3 and delete old tensors promptly.
  • Unsure about utilization? Watch nvidia-smi alongside htop. When GPU bandwidth stays high and CPU cores are busy together, the overlap is working.

Wrap-up and Next Steps

  • GPU inference is asynchronous by default—don’t leave the CPU waiting at the door.
  • Once CPU/GPU workloads are decoupled, average latency drops immediately.
  • PyTorch makes overlap approachable: a few lines create a working pipeline.

Next experiments:

  1. Swap heavy_cpu_work for your real post-processing or logging and measure the gains.
  2. Use torch.cuda.streams.Stream to split preprocessing, inference, and post-processing into multiple pipelines.
  3. Mark key regions with nvtx and inspect the timeline in nsys to confirm overlap is real.

Found this useful? Drop a comment about your own inference speed-up tips, or share it with teammates tuning their pipelines. Every like nudges us to write another CPU/GPU collaboration deep dive.