Installation
This guide will help you install and run the vLLM Semantic Router. The router runs entirely on CPU and does not require GPU for inference.
System Requirementsâ
note
No GPU required - the router runs efficiently on CPU using optimized BERT models.
Requirements:
- Python: 3.10 or higher
- Docker: Required for running the router container
- Optional: HuggingFace token (only for gated models)
Quick Startâ
1. Install vLLM Semantic Routerâ
# Create a virtual environment (recommended)
python -m venv vsr
source vsr/bin/activate # On Windows: vsr\Scripts\activate
# Install from PyPI
pip install vllm-sr
Verify installation:
vllm-sr --version
2. Initialize Configurationâ
# Create config.yaml in current directory
vllm-sr init
This creates a config.yaml file with default settings.
3. Configure Your Backendâ
Edit the generated config.yaml to configure your model and backend endpoint:
providers:
# Model configuration
models:
- name: "qwen/qwen3-1.8b" # Model name
endpoints:
- name: "my_vllm"
weight: 1
endpoint: "localhost:8000" # Domain or IP:port
protocol: "http" # http or https
access_key: "your-token-here" # Optional: for authentication
# Default model for fallback
default_model: "qwen/qwen3-1.8b"
Configuration Options:
- endpoint: Domain name or IP address with port (e.g.,
localhost:8000,api.openai.com) - protocol:
httporhttps - access_key: Optional authentication token (Bearer token)
- weight: Load balancing weight (default: 1)
Example: Local vLLM
providers:
models:
- name: "qwen/qwen3-1.8b"
endpoints:
- name: "local_vllm"
weight: 1
endpoint: "localhost:8000"
protocol: "http"
default_model: "qwen/qwen3-1.8b"
Example: External API with HTTPS
providers:
models:
- name: "openai/gpt-4"
endpoints:
- name: "openai_api"
weight: 1
endpoint: "api.openai.com"
protocol: "https"
access_key: "sk-xxxxxx"
default_model: "openai/gpt-4"
4. Start the Routerâ
vllm-sr serve
The router will:
- Automatically download required ML models (~1.5GB, one-time)
- Start Envoy proxy on port 8888
- Start the semantic router service
- Enable metrics on port 9190
5. Test the Routerâ
curl http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Common Commandsâ
# View logs
vllm-sr logs router # Router logs
vllm-sr logs envoy # Envoy logs
vllm-sr logs router -f # Follow logs
# Check status
vllm-sr status
# Stop the router
vllm-sr stop
Advanced Configurationâ
HuggingFace Settingsâ
Set environment variables before starting:
export HF_ENDPOINT=https://huggingface.co # Or mirror: https://hf-mirror.com
export HF_TOKEN=your_token_here # Only for gated models
export HF_HOME=/path/to/cache # Custom cache directory
vllm-sr serve
Custom Optionsâ
# Use custom config file
vllm-sr serve --config my-config.yaml
# Use custom Docker image
vllm-sr serve --image ghcr.io/vllm-project/semantic-router/vllm-sr:latest
# Control image pull policy
vllm-sr serve --image-pull-policy always
Next Stepsâ
- Configuration Guide - Advanced routing and signal configuration
- API Documentation - Complete API reference
- Tutorials - Learn by example
Getting Helpâ
- Issues: GitHub Issues
- Community: Join
#semantic-routerchannel in vLLM Slack - Documentation: vllm-semantic-router.com