AI & Machine Learning

vLLM and Parallelized Inference: Scaling LLM Serving to Production

January 10, 2025
18 min read
By ML Engineering Team

Deep dive into vLLM architecture, continuous batching, PagedAttention, tensor parallelism, and advanced techniques for serving large language models at scale with optimal throughput and latency.

Introduction: The LLM Serving Challenge

Serving large language models (LLMs) in production presents unique challenges:

  1. 1.Memory bottlenecks: Large models (7B-70B+ parameters) require substantial GPU memory
  2. 2.KV cache management: Attention mechanisms require storing key-value caches that grow with sequence length
  3. 3.Throughput optimization: Balancing latency and throughput for cost-effective serving
  4. 4.Dynamic batching: Handling variable-length requests efficiently

vLLM (Virtual Large Language Model) addresses these challenges through innovative techniques like PagedAttention, continuous batching, and efficient memory management. This article explores vLLM's architecture and parallelization strategies for production-scale LLM serving.

The Memory Challenge: Understanding KV Caches

KV Cache Basics:

  • Key vectors: [batch_size, num_heads, seq_len, head_dim]
  • Value vectors: [batch_size, num_heads, seq_len, head_dim]
  • num_layers = 80
  • num_heads = 64
  • head_dim = 128
  • dtype = float16 (2 bytes)

Memory per token = 2 * 80 * 64 * 128 * 2 bytes = 2.6 MB

For a 2048 token sequence: 2.6 MB * 2048 ≈ 5.3 GB!

Traditional serving wastes 60-80% of memory due to fragmentation.

PagedAttention: Virtual Memory for LLMs

How PagedAttention Works:

  1. 1.Divide KV cache into fixed-size blocks (e.g., 16 tokens)
  2. 2.Store each block in non-contiguous memory
  3. 3.Maintain block table mapping logical to physical blocks
  4. 4.Compute attention using block-level addressing
  • 2-4x better memory utilization
  • Support for longer sequences
  • Higher batch sizes
  • Prefix caching for common prompts

vLLM Architecture

Python
# vLLM Server Architecture

from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine

class vLLMServer:
    def __init__(self, model_name: str, tensor_parallel_size: int = 1):
        engine_args = AsyncEngineArgs(
            model=model_name,
            tensor_parallel_size=tensor_parallel_size,
            # PagedAttention configuration
            block_size=16,  # Tokens per block
            max_num_batched_tokens=8192,
            max_num_seqs=256,  # Max concurrent requests
            # Memory management
            gpu_memory_utilization=0.95,  # Use 95% of GPU memory
            swap_space=4,  # GB of CPU swap space
            # Performance
            enable_prefix_caching=True,
            disable_log_stats=False
        )
        
        self.engine = AsyncLLMEngine.from_engine_args(engine_args)
        
    async def generate(
        self,
        prompt: str,
        max_tokens: int = 512,
        temperature: float = 0.7
    ):
        sampling_params = SamplingParams(
            temperature=temperature,
            max_tokens=max_tokens,
            top_p=0.95
        )
        
        # Continuous batching handles request scheduling
        result_generator = self.engine.generate(
            prompt,
            sampling_params,
            request_id=generate_request_id()
        )
        
        async for output in result_generator:
            yield output

# Key features:
# - PagedAttention for memory efficiency
# - Continuous batching for throughput
# - Prefix caching for common prompts
# - Automatic request scheduling

Continuous Batching

Continuous Batching Benefits:

  1. 1.Higher GPU Utilization: No idle time waiting for batch completion
  2. 2.Better Throughput: 2-3x improvement over static batching
  3. 3.Lower Latency: Short requests don't wait for long ones
  4. 4.Automatic Scheduling: Engine handles complexity
  • Track per-request generation state
  • Add new requests to batch when slots free
  • Remove completed requests immediately
  • Recompute attention only for active sequences

Tensor Parallelism

Tensor Parallelism Strategies:

  1. 1.Attention Head Parallelism: Split attention heads across GPUs
  2. 2.Layer-wise Parallelism: Different layers on different GPUs
  3. 3.Pipeline Parallelism: Micro-batching across GPU pipeline
  • Model too large for single GPU (>40GB)
  • Need lower per-request latency
  • Have high-bandwidth GPU interconnect (NVLink, InfiniBand)

Production vLLM Deployment

Python
# production_vllm_server.py
import asyncio
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
from prometheus_client import Counter, Histogram, Gauge
import logging

# Metrics
REQUEST_COUNT = Counter('vllm_requests_total', 'Total requests')
REQUEST_DURATION = Histogram('vllm_request_duration_seconds', 'Request duration')
GPU_MEMORY = Gauge('vllm_gpu_memory_used_bytes', 'GPU memory used')
ACTIVE_REQUESTS = Gauge('vllm_active_requests', 'Active requests')

class ProductionvLLMServer:
    def __init__(self):
        self.engine_args = AsyncEngineArgs(
            model="meta-llama/Llama-2-70b-chat-hf",
            tensor_parallel_size=4,  # 4x A100 80GB
            dtype="float16",
            # PagedAttention config
            block_size=16,
            max_num_batched_tokens=16384,
            max_num_seqs=512,
            gpu_memory_utilization=0.95,
            # Performance optimizations
            enable_prefix_caching=True,
            enable_chunked_prefill=True,
            # Reliability
            disable_log_stats=False,
            max_log_len=100,
        )
        
        self.engine = AsyncLLMEngine.from_engine_args(self.engine_args)
        self.logger = logging.getLogger(__name__)
        
        # Start monitoring
        asyncio.create_task(self._monitor_resources())
    
    async def generate(
        self,
        prompt: str,
        max_tokens: int = 512,
        temperature: float = 0.7,
        request_id: str = None
    ):
        REQUEST_COUNT.inc()
        ACTIVE_REQUESTS.inc()
        
        try:
            sampling_params = SamplingParams(
                temperature=temperature,
                max_tokens=max_tokens,
                top_p=0.95,
                frequency_penalty=0.1,
                presence_penalty=0.1,
                stop=["</s>", "User:", "Assistant:"]
            )
            
            with REQUEST_DURATION.time():
                result_generator = self.engine.generate(
                    prompt,
                    sampling_params,
                    request_id=request_id or self._generate_id()
                )
                
                async for output in result_generator:
                    # Stream tokens back to client
                    yield {
                        'text': output.outputs[0].text,
                        'finished': output.finished,
                        'tokens': len(output.outputs[0].token_ids)
                    }
                    
        except Exception as e:
            self.logger.error(f"Generation error: {e}", exc_info=True)
            raise
        finally:
            ACTIVE_REQUESTS.dec()
    
    async def _monitor_resources(self):
        """Monitor GPU memory and engine stats"""
        while True:
            try:
                stats = await self.engine.get_model_config()
                # Update Prometheus metrics
                GPU_MEMORY.set(stats.gpu_memory_used)
                
                self.logger.info(
                    f"vLLM Stats - Active: {stats.num_active_seqs}, "
                    f"GPU Memory: {round(stats.gpu_memory_used / 1e9, 2)}GB"
                )
            except Exception as e:
                self.logger.error(f"Monitoring error: {e}")
            
            await asyncio.sleep(10)
    
    async def health_check(self) -> dict:
        """Health check endpoint"""
        try:
            stats = await self.engine.get_model_config()
            return {
                'status': 'healthy',
                'model': self.engine_args.model,
                'gpu_memory_utilization': stats.gpu_memory_used / stats.gpu_memory_total,
                'active_requests': stats.num_active_seqs
            }
        except Exception as e:
            return {'status': 'unhealthy', 'error': str(e)}

# FastAPI Integration
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

app = FastAPI()
server = ProductionvLLMServer()

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 512
    temperature: float = 0.7

@app.post("/v1/completions")
async def generate(request: GenerateRequest):
    try:
        tokens = []
        async for output in server.generate(
            request.prompt,
            request.max_tokens,
            request.temperature
        ):
            tokens.append(output)
        return {"completion": tokens[-1]['text']}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return await server.health_check()

Performance Optimization Techniques

  • Cache common prompt prefixes (e.g., system prompts)
  • Share KV cache blocks across requests
  • Reduces computation for repeated prefixes
  • 2-5x speedup for chat applications
  • Split long prompts into chunks
  • Overlap prefill with generation
  • Reduces time-to-first-token (TTFT)
  • INT8/INT4 quantization for weights
  • Reduces memory by 2-4x
  • Minimal quality loss with proper calibration
  • AWQ, GPTQ, SmoothQuant methods
  • Fused attention kernel
  • 2-3x faster attention computation
  • Lower memory footprint
  • Required for long contexts (>4K tokens)
  • Use small model to predict tokens
  • Verify with large model in parallel
  • 2-3x speedup for compatible prompts

Scaling Strategies

Scaling Patterns:

  1. 1.Horizontal Scaling: Multiple vLLM instances behind load balancer
  2. 2.Model Parallelism: Tensor/pipeline parallelism for large models
  3. 3.Mixed Models: Different model sizes for different use cases
  4. 4.Request Routing: Route by complexity, latency requirements
  • 4x A100 80GB: ~$10-15/hour on AWS/GCP
  • Throughput: ~100-200 concurrent requests
  • Cost per 1M tokens: $0.50-1.00 (vs $20-60 for APIs)
  • Break-even: 20M+ tokens/month

Monitoring and Observability

PROMETHEUS
# Prometheus metrics for vLLM

# Request metrics
vllm_requests_total{model="llama-2-70b",status="success"} 15234
vllm_requests_total{model="llama-2-70b",status="error"} 12
vllm_request_duration_seconds_bucket{le="1.0"} 8500
vllm_request_duration_seconds_bucket{le="5.0"} 14800

# Resource metrics
vllm_gpu_memory_used_bytes{gpu="0"} 75000000000
vllm_gpu_memory_used_bytes{gpu="1"} 74500000000
vllm_active_requests 42
vllm_queue_size 8

# KV Cache metrics
vllm_kv_cache_usage_ratio 0.87
vllm_kv_cache_blocks_used 4250
vllm_kv_cache_blocks_total 5000

# Throughput metrics
vllm_tokens_generated_total 523847
vllm_tokens_per_second 1247.3

# Grafana Dashboard Queries:
# - Request latency p50, p95, p99
# - Throughput (tokens/sec, requests/sec)
# - GPU utilization and memory
# - Queue depth and wait times
# - Error rates by type

Key Takeaways

  1. 1.PagedAttention enables 2-4x better memory utilization through virtual memory techniques
  1. 2.Continuous Batching provides 2-3x higher throughput than static batching
  1. 3.Tensor Parallelism allows serving models too large for single GPU
  1. 4.vLLM combines these techniques for production-ready LLM serving
  1. 5.Cost Efficiency: Self-hosted vLLM can be 20-40x cheaper than API calls at scale
  1. 6.Performance:
  1. 7.When to Use vLLM:
  1. 8.Alternatives:

vLLM has become the standard for high-performance LLM serving, offering the best combination of throughput, latency, and cost efficiency for production deployments.

vLLMLLMInferenceParallelizationPerformanceProduction

Need Expert Help?

Our team has extensive experience implementing solutions like this. Let's discuss your project.