Inference Optimization

Mac mini connected to SSD freezer and DRAM fridge, illustrating the layered architecture of LLM in a Flash

Cramming a 400B Model into 48GB: The Magic Behind LLM in a Flash

An Apple paper from 2023 made it possible to run a 400 billion parameter model on an ordinary MacBook. The core technologies—MoE and quantization—hide an engineering philosophy built around on-demand loading.

March 24, 2026 · 5 min · 857 words · Dream Beast Programming

oMLX runs local LLMs on Mac Apple Silicon, dramatically outperforming Ollama with TTFT dropping from 90s to 1-3s

90 Seconds of Waiting, Gone: How oMLX Buries Ollama on Mac

oMLX is built for Apple Silicon, using the MLX framework, SSD-backed KV cache, and continuous batching to cut TTFT from 90 seconds to 1-3 seconds in long-context scenarios, comprehensively outperforming Ollama.

March 23, 2026 · 6 min · 1133 words · Mengshou Programming

Tokencake: Multi-Agent KV Cache Scheduling That Cuts vLLM Latency by Half

Beihang/Peking/Alibaba introduce Tokencake, a KV-cache-centric serving framework for multi-agent apps. With time+space scheduling plus CPU buffering and progressive GPU reservation, it trims end-to-end latency by 47%+ versus vLLM and lifts GPU cache utilization by ~17%.

October 30, 2025 · 4 min · 679 words · DreamBeast Programming