Skip to main content
Version: Next 🚧

Installation

This guide will help you install and run the vLLM Semantic Router. The router runs entirely on CPU and does not require GPU for inference.

System Requirements​

note

No GPU required - the router runs efficiently on CPU using optimized BERT models.

Requirements:

  • Python: 3.10 or higher
  • Docker: Required for running the router container
  • Optional: HuggingFace token (only for gated models)

Quick Start​

1. Install vLLM Semantic Router​

# Create a virtual environment (recommended)
python -m venv vsr
source vsr/bin/activate # On Windows: vsr\Scripts\activate

# Install from PyPI
pip install vllm-sr

Verify installation:

vllm-sr --version

2. Initialize Configuration​

# Create config.yaml in current directory
vllm-sr init

This creates a config.yaml file with default settings.

3. Configure Your Backend​

Edit the generated config.yaml to configure your model and backend endpoint:

providers:
# Model configuration
models:
- name: "qwen/qwen3-1.8b" # Model name
endpoints:
- name: "my_vllm"
weight: 1
endpoint: "localhost:8000" # Domain or IP:port
protocol: "http" # http or https
access_key: "your-token-here" # Optional: for authentication

# Default model for fallback
default_model: "qwen/qwen3-1.8b"

Configuration Options:

  • endpoint: Domain name or IP address with port (e.g., localhost:8000, api.openai.com)
  • protocol: http or https
  • access_key: Optional authentication token (Bearer token)
  • weight: Load balancing weight (default: 1)

Example: Local vLLM

providers:
models:
- name: "qwen/qwen3-1.8b"
endpoints:
- name: "local_vllm"
weight: 1
endpoint: "localhost:8000"
protocol: "http"
default_model: "qwen/qwen3-1.8b"

Example: External API with HTTPS

providers:
models:
- name: "openai/gpt-4"
endpoints:
- name: "openai_api"
weight: 1
endpoint: "api.openai.com"
protocol: "https"
access_key: "sk-xxxxxx"
default_model: "openai/gpt-4"

4. Start the Router​

vllm-sr serve

The router will:

  • Automatically download required ML models (~1.5GB, one-time)
  • Start Envoy proxy on port 8888
  • Start the semantic router service
  • Enable metrics on port 9190

5. Test the Router​

curl http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [{"role": "user", "content": "Hello!"}]
}'

Common Commands​

# View logs
vllm-sr logs router # Router logs
vllm-sr logs envoy # Envoy logs
vllm-sr logs router -f # Follow logs

# Check status
vllm-sr status

# Stop the router
vllm-sr stop

Advanced Configuration​

HuggingFace Settings​

Set environment variables before starting:

export HF_ENDPOINT=https://huggingface.co  # Or mirror: https://hf-mirror.com
export HF_TOKEN=your_token_here # Only for gated models
export HF_HOME=/path/to/cache # Custom cache directory

vllm-sr serve

Custom Options​

# Use custom config file
vllm-sr serve --config my-config.yaml

# Use custom Docker image
vllm-sr serve --image ghcr.io/vllm-project/semantic-router/vllm-sr:latest

# Control image pull policy
vllm-sr serve --image-pull-policy always

Next Steps​

Getting Help​