<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>LLM Inference Frameworks - Category - Naifan Li's Blog</title><link>https://blog.omagiclee.com/categories/llm-inference-frameworks/</link><description>LLM Inference Frameworks - Category - Naifan Li's Blog</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Thu, 27 Nov 2025 21:07:50 +0800</lastBuildDate><atom:link href="https://blog.omagiclee.com/categories/llm-inference-frameworks/" rel="self" type="application/rss+xml"/><item><title>vLLM: Easy, fast, and cheap LLM inference and serving</title><link>https://blog.omagiclee.com/posts/toolkits/llm-inference-engines/vllm/</link><pubDate>Thu, 27 Nov 2025 21:07:50 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/toolkits/llm-inference-engines/vllm/</guid><description><![CDATA[<p><a href="https://docs.vllm.ai/en/latest/" target="_blank" rel="noopener noreffer ">Docs</a> · <a href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener noreffer ">GitHub</a></p>
<p>vLLM is a high-throughput and memory-efficient <strong>inference and serving engine for LLMs</strong>.</p>
<ul>
<li>Run open-source models on vLLM</li>
<li>Build appplications with vLLM</li>
<li>Build vLLM</li>
</ul>
<p>vLLM is fast with:</p>
<ul>
<li>State-of-the-art serving throughput</li>
<li>Efficient management of attention key and value memory with PagedAttention</li>
<li>Continuous batching of incoming requests</li>
<li>Fast model execution with CUDA/HIP graph</li>
<li>Quantization: GPTQ, AWQ, INT4, INT8, and FP8</li>
<li>Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.</li>
<li>Speculative decoding</li>
<li>Chunked prefill</li>
</ul>
<p>vLLM is flexible and easy to use with:</p>]]></description></item></channel></rss>