Intelligent LoRA Routing
This guide shows you how to combine intelligent routing (domain/embedding/keyword/MCP) with LoRA adapters to route requests to domain-specific models. LoRA routing uses the classification methods from previous guides to detect intent, then automatically selects the appropriate LoRA adapter on the vLLM backend.
Key Advantagesâ
- Intent-aware adapter selection: Combines any classification method (domain/embedding/keyword/MCP) with LoRA adapters
- Memory efficient: Share base model weights across multiple domain adapters (<1% parameters per adapter)
- Transparent to users: Users send requests to one endpoint, router handles adapter selection
- Flexible classification: Choose the best routing method for your use case (domain for accuracy, keyword for compliance, etc.)
What Problem Does It Solve?â
vLLM supports multiple LoRA adapters, but users must manually specify which adapter to use. LoRA routing automates this:
- Manual adapter selection: Users don't know which adapter to use â Router classifies intent and selects adapter automatically
- Memory efficiency: Multiple full models don't fit in GPU â LoRA adapters share base weights (~1% overhead per adapter)
- Deployment simplicity: Managing multiple model endpoints is complex â Single vLLM instance serves all adapters
- Intent detection: Generic base model lacks domain expertise â Router routes to specialized adapters based on query content
When to Useâ
- Multi-domain vLLM deployments with LoRA adapters for different domains (technical, medical, legal, etc.)
- Automatic adapter selection where you want users to send requests without knowing adapter names
- Combining classification + LoRA: Use domain routing for accuracy, keyword routing for compliance, or MCP for custom logic
- Memory-constrained scenarios where multiple full models don't fit but LoRA adapters do
- A/B testing different adapter versions by adjusting category scores
Configurationâ
Prerequisitesâ
- A running vLLM server with LoRA support enabled
- LoRA adapter files (fine-tuned for specific domains)
- Envoy + the router (see Installation guide)
1. Start vLLM with LoRA Adaptersâ
First, start your vLLM server with LoRA support enabled:
vllm serve meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules \
technical-lora=/path/to/technical-adapter \
medical-lora=/path/to/medical-adapter \
legal-lora=/path/to/legal-adapter \
--host 0.0.0.0 \
--port 8000
Key flags:
--enable-lora: Enables LoRA adapter support--lora-modules: Registers LoRA adapters with their names and paths- Format:
adapter-name=/path/to/adapter
2. Router Configurationâ
Put this in config/config.yaml (or merge into your existing config):
# Category classifier (required for intent detection)
classifier:
category_model:
model_id: "models/mom-domain-classifier"
threshold: 0.6
use_cpu: true
category_mapping_path: "models/mom-domain-classifier/category_mapping.json"
# vLLM endpoint hosting your base model + LoRA adapters
vllm_endpoints:
- name: "vllm-primary"
address: "127.0.0.1"
port: 8000
weight: 1
# Define base model and available LoRA adapters
model_config:
"llama2-7b":
reasoning_family: "llama2"
preferred_endpoints: ["vllm-primary"]
# IMPORTANT: Define all available LoRA adapters here
loras:
- name: "technical-lora"
description: "Optimized for programming and technical questions"
- name: "medical-lora"
description: "Specialized for medical and healthcare domain"
- name: "legal-lora"
description: "Fine-tuned for legal questions"
# Default model for fallback
default_model: "llama2-7b"
# Categories with LoRA routing
categories:
- name: "technical"
description: "Programming, software engineering, and technical questions"
system_prompt: "You are an expert software engineer."
model_scores:
- model: "llama2-7b" # Base model name
lora_name: "technical-lora" # LoRA adapter to use
score: 1.0
use_reasoning: true
reasoning_effort: "medium"
- name: "medical"
description: "Medical and healthcare questions"
system_prompt: "You are a medical expert."
model_scores:
- model: "llama2-7b"
lora_name: "medical-lora" # Different LoRA for medical
score: 1.0
use_reasoning: true
reasoning_effort: "high"
- name: "legal"
description: "Legal questions and law-related topics"
system_prompt: "You are a legal expert."
model_scores:
- model: "llama2-7b"
lora_name: "legal-lora" # Different LoRA for legal
score: 1.0
use_reasoning: true
reasoning_effort: "high"
- name: "general"
description: "General questions"
system_prompt: "You are a helpful assistant."
model_scores:
- model: "llama2-7b" # No lora_name = uses base model
score: 0.8
use_reasoning: false
How It Worksâ
LoRA routing combines intelligent classification with vLLM's LoRA adapter support:
Flow:
- User sends query to router (doesn't specify adapter)
- Classification using any method (domain/embedding/keyword/MCP) detects intent
- Category matched (e.g., "technical" category)
- Router looks up
model_scoresfor that category - LoRA adapter selected via
lora_namefield (e.g., "technical-lora") - Request forwarded to vLLM with
model="technical-lora" - vLLM loads adapter and generates response with domain-specific knowledge
Key insight: The classification method (domain/embedding/keyword/MCP) determines the category, then the category's lora_name determines which adapter to use.
Test Domain Aware LoRA Routingâ
Send test queries and verify they're classified correctly:
# Technical query
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "MoM", "messages": [{"role": "user", "content": "Explain async/await in JavaScript"}]}'
# Medical query
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "MoM", "messages": [{"role": "user", "content": "What causes high blood pressure?"}]}'
Check the router logs to confirm the correct LoRA adapter is selected for each query.
Real-World Use Casesâ
1. Healthcare Platform (Domain Routing + LoRA)â
Problem: Medical queries need specialized adapters, but users don't know which to use Solution: Domain routing classifies into diagnosis/pharmacy/mental-health, routes to corresponding LoRA adapters Impact: Automatic adapter selection, 70GB memory vs 210GB for 3 full models
2. Legal Tech (Keyword Routing + LoRA for Compliance)â
Problem: Compliance requires auditable routing to jurisdiction-specific legal adapters Solution: Keyword routing detects "US law"/"EU law"/"contract" keywords, routes to compliant LoRA adapters Impact: 100% auditable routing decisions, 95% citation accuracy with specialized adapters
3. Customer Support (Embedding Routing + LoRA)â
Problem: Support queries span IT/HR/finance, users phrase questions in many ways Solution: Embedding routing matches semantic intent, routes to department-specific LoRA adapters Impact: Handles paraphrases, single endpoint serves all departments with <10ms adapter switching
4. EdTech Platform (Domain Routing + LoRA)â
Problem: Students ask math/science/literature questions, need subject-specific tutors Solution: Domain routing classifies academic subject, routes to subject-specific LoRA adapters Impact: 4 specialized tutors for cost of 1.2 base models, 70% cost savings
5. Multi-Tenant SaaS (MCP Routing + LoRA)â
Problem: Each tenant has custom LoRA adapters, need dynamic routing based on tenant ID Solution: MCP routing queries tenant database, returns tenant-specific LoRA adapter name Impact: 1000+ tenants with custom adapters, private routing logic, A/B testing support
Next Stepsâ
- See complete LoRA routing example
- Learn about decision configuration
- Read modular LoRA blog post for architecture details