Below you will find pages that utilize the taxonomy term “Llm”
Postsread more
The Hidden Power of Tensor Offloading: Boosting Local LLM Performance
Running large language models locally has been a fascinating journey, especially for those of us who’ve been tinkering with these systems on consumer-grade hardware. Recently, I’ve discovered something quite remarkable about tensor offloading that’s completely changed how I approach running these models on my setup.
The traditional approach of offloading entire layers to manage VRAM constraints turns out to be rather inefficient. Instead, selectively offloading specific tensors - particularly the larger FFN (Feed Forward Network) tensors - to the CPU while keeping the attention mechanisms on the GPU can dramatically improve performance. We’re talking about potential speed improvements of 200% or more in some cases.