Skip to main content
Version: Next 🚧

vLLM Semantic Router

System-Level Intelligence for Mixture-of-Models (MoM) - An intelligent routing layer that brings collective intelligence to LLM systems. Acting as an Envoy External Processor (ExtProc), it uses a signal-driven decision engine and plugin chain architecture to capture missing signals, make better routing decisions, and secure your LLM infrastructure.

Project Goals​

We are building the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems, answering:

  1. How to capture the missing signals in request, response and context?
  2. How to combine the signals to make better decisions?
  3. How to collaborate more efficiently between different models?
  4. How to secure the real world and LLM system from jailbreaks, PII leaks, hallucinations?
  5. How to collect valuable signals and build a self-learning system?

Core Architecture​

Signal-Driven Decision Engine​

Captures and combines 6 types of signals to make intelligent routing decisions:

Signal TypeDescriptionUse Case
keywordPattern matching with AND/OR operatorsFast rule-based routing for specific terms
embeddingSemantic similarity using embeddingsIntent detection and semantic understanding
domainMMLU domain classification (14 categories)Academic and professional domain routing
fact_checkML-based fact-checking requirement detectionIdentify queries needing fact verification
user_feedbackUser satisfaction and feedback classificationHandle follow-up messages and corrections
preferenceLLM-based route preference matchingComplex intent analysis via external LLM

How it works: Signals are extracted from requests, combined using AND/OR operators in decision rules, and used to select the best model and configuration.

Plugin Chain Architecture​

Extensible plugin system for request/response processing:

Plugin TypeDescriptionUse Case
semantic-cacheSemantic similarity-based cachingReduce latency and costs for similar queries
jailbreakAdversarial prompt detectionBlock prompt injection and jailbreak attempts
piiPersonally identifiable information detectionProtect sensitive data and ensure compliance
system_promptDynamic system prompt injectionAdd context-aware instructions per route
header_mutationHTTP header manipulationControl routing and backend behavior
hallucinationToken-level hallucination detectionReal-time fact verification during generation

How it works: Plugins form a processing chain, each plugin can inspect/modify requests and responses, with configurable enable/disable per decision.

Architecture Overview​

Key Benefits​

Intelligent Routing​

  • Signal Fusion: Combine multiple signals (keyword + embedding + domain) for accurate routing
  • Adaptive Decisions: Use AND/OR operators to create complex routing logic
  • Model Specialization: Route math to math models, code to code models, etc.

Security & Compliance​

  • Multi-layer Protection: PII detection, jailbreak prevention, hallucination detection
  • Policy Enforcement: Model-specific PII policies and security rules
  • Audit Trail: Complete logging of all security decisions

Performance & Cost​

  • Semantic Caching: 10-100x latency reduction for similar queries
  • Smart Model Selection: Use smaller models for simple tasks, larger for complex
  • Tool Optimization: Auto-select relevant tools to reduce token usage

Flexibility & Extensibility​

  • Plugin Architecture: Add custom processing logic without modifying core
  • Signal Extensibility: Define new signal types for your use cases
  • Configuration-Driven: Change routing behavior without code changes

Use Cases​

  • Enterprise API Gateways: Intelligent routing with security and compliance
  • Multi-tenant Platforms: Per-tenant routing policies and model selection
  • Development Environments: Cost optimization through smart model selection
  • Production Services: High-performance routing with comprehensive monitoring
  • Regulated Industries: Compliance-ready with PII detection and audit trails

Documentation Structure​

This documentation is organized into the following sections:

Overview​

Learn about our goals, semantic routing concepts, collective intelligence, and signal-driven decisions.

Installation & Configuration​

Get started with installation and learn how to configure signals, decisions, and plugins.

Tutorials​

Step-by-step guides for implementing intelligent routing, semantic caching, content safety, and observability.

Contributing​

We welcome contributions! Please see our Contributing Guide for details.

License​

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.