Distributed Tracing with OpenTelemetry
This guide explains how to configure and use distributed tracing in vLLM Semantic Router for enhanced observability and debugging capabilities.
Overviewâ
vLLM Semantic Router implements comprehensive distributed tracing using OpenTelemetry, providing fine-grained visibility into the request processing pipeline. Tracing helps you:
- Debug Production Issues: Trace individual requests through the entire routing pipeline
- Optimize Performance: Identify bottlenecks in classification, caching, and routing
- Monitor Security: Track PII detection and jailbreak prevention operations
- Analyze Decisions: Understand routing logic and reasoning mode selection
- Correlate Services: Connect traces across the router and vLLM backends
Architectureâ
Trace Hierarchyâ
A typical request trace follows this structure:
semantic_router.request.received [root span]
ââ semantic_router.classification
ââ semantic_router.security.pii_detection
ââ semantic_router.security.jailbreak_detection
ââ semantic_router.cache.lookup
ââ semantic_router.routing.decision
ââ semantic_router.backend.selection
ââ semantic_router.system_prompt.injection
ââ semantic_router.upstream.request
Span Attributesâ
Each span includes rich attributes following OpenInference conventions for LLM observability:
Request Metadata:
request.id- Unique request identifieruser.id- User identifier (if available)http.method- HTTP methodhttp.path- Request path
Model Information:
model.name- Selected model namerouting.original_model- Original requested modelrouting.selected_model- Model selected by router
Classification:
category.name- Classified categoryclassifier.type- Classifier implementationclassification.time_ms- Classification duration
Security:
pii.detected- Whether PII was foundpii.types- Types of PII detectedjailbreak.detected- Whether jailbreak attempt detectedsecurity.action- Action taken (blocked, allowed)
Routing:
routing.strategy- Routing strategy (auto, specified)routing.reason- Reason for routing decisionreasoning.enabled- Whether reasoning mode enabledreasoning.effort- Reasoning effort level
Performance:
cache.hit- Cache hit/miss statuscache.lookup_time_ms- Cache lookup durationprocessing.time_ms- Total processing time
Configurationâ
Basic Configurationâ
Add the observability.tracing section to your config.yaml:
observability:
tracing:
enabled: true
provider: "opentelemetry"
exporter:
type: "stdout" # or "otlp"
endpoint: "localhost:4317"
insecure: true
sampling:
type: "always_on" # or "probabilistic"
rate: 1.0
resource:
service_name: "vllm-semantic-router"
service_version: "v0.1.0"
deployment_environment: "production"
Configuration Optionsâ
Exporter Typesâ
stdout - Print traces to console (development)
exporter:
type: "stdout"
otlp - Export to OTLP-compatible backend (production)
exporter:
type: "otlp"
endpoint: "jaeger:4317" # Jaeger, Tempo, Datadog, etc.
insecure: true # Use false with TLS in production
Sampling Strategiesâ
always_on - Sample all requests (development/debugging)
sampling:
type: "always_on"
always_off - Disable sampling (emergency performance)
sampling:
type: "always_off"
probabilistic - Sample a percentage of requests (production)
sampling:
type: "probabilistic"
rate: 0.1 # Sample 10% of requests
Environment-Specific Configurationsâ
Developmentâ
observability:
tracing:
enabled: true
provider: "opentelemetry"
exporter:
type: "stdout"
sampling:
type: "always_on"
resource:
service_name: "vllm-semantic-router-dev"
deployment_environment: "development"
Productionâ
observability:
tracing:
enabled: true
provider: "opentelemetry"
exporter:
type: "otlp"
endpoint: "tempo:4317"
insecure: false # Use TLS
sampling:
type: "probabilistic"
rate: 0.1 # 10% sampling
resource:
service_name: "vllm-semantic-router"
service_version: "v0.1.0"
deployment_environment: "production"
Deploymentâ
With Jaegerâ
- Start Jaeger (all-in-one for testing):
docker run -d --name jaeger \
-p 4317:4317 \
-p 16686:16686 \
jaegertracing/all-in-one:latest
- Configure Router:
observability:
tracing:
enabled: true
exporter:
type: "otlp"
endpoint: "localhost:4317"
insecure: true
sampling:
type: "probabilistic"
rate: 0.1
- Access Jaeger UI: http://localhost:16686
With Grafana Tempoâ
- Configure Tempo (tempo.yaml):
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
storage:
trace:
backend: local
local:
path: /tmp/tempo/traces
- Start Tempo:
docker run -d --name tempo \
-p 4317:4317 \
-p 3200:3200 \
-v $(pwd)/tempo.yaml:/etc/tempo.yaml \
grafana/tempo:latest \
-config.file=/etc/tempo.yaml
- Configure Router:
observability:
tracing:
enabled: true
exporter:
type: "otlp"
endpoint: "tempo:4317"
insecure: true
Kubernetes Deploymentâ
apiVersion: v1
kind: ConfigMap
metadata:
name: router-config
data:
config.yaml: |
observability:
tracing:
enabled: true
exporter:
type: "otlp"
endpoint: "jaeger-collector.observability.svc:4317"
insecure: false
sampling:
type: "probabilistic"
rate: 0.1
resource:
service_name: "vllm-semantic-router"
deployment_environment: "production"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: semantic-router
spec:
template:
spec:
containers:
- name: router
image: vllm-semantic-router:latest
env:
- name: CONFIG_PATH
value: /config/config.yaml
volumeMounts:
- name: config
mountPath: /config
volumes:
- name: config
configMap:
name: router-config
Usage Examplesâ
Viewing Tracesâ
Console Output (stdout exporter)â
{
"Name": "semantic_router.classification",
"SpanContext": {
"TraceID": "abc123...",
"SpanID": "def456..."
},
"Attributes": [
{
"Key": "category.name",
"Value": "math"
},
{
"Key": "classification.time_ms",
"Value": 45
}
],
"Duration": 45000000
}
Jaeger UIâ
- Navigate to http://localhost:16686
- Select service:
vllm-semantic-router - Click "Find Traces"
- View trace details and timeline
Analyzing Performanceâ
Find slow requests:
Service: vllm-semantic-router
Min Duration: 1s
Limit: 20
Analyze classification bottlenecks:
Filter by operation: semantic_router.classification
Sort by duration (descending)
Track cache effectiveness:
Filter by tag: cache.hit = true
Compare durations with cache misses
Debugging Issuesâ
Find failed requests:
Filter by tag: error = true
Trace specific request:
Filter by tag: request.id = req-abc-123
Find PII violations:
Filter by tag: security.action = blocked
Trace Context Propagationâ
The router automatically propagates trace context using W3C Trace Context headers:
Request headers (extracted by router):
traceparent: 00-abc123-def456-01
tracestate: vendor=value
Upstream headers (injected by router):
traceparent: 00-abc123-ghi789-01
x-vsr-destination-endpoint: endpoint1
x-selected-model: gpt-4
This enables end-to-end tracing from client â router â vLLM backend.
Performance Considerationsâ
Overheadâ
Tracing adds minimal overhead when properly configured:
- Always-on sampling: ~1-2% latency increase
- 10% probabilistic: ~0.1-0.2% latency increase
- Async export: No blocking on span export
Optimization Tipsâ
-
Use probabilistic sampling in production
sampling:
type: "probabilistic"
rate: 0.1 # Adjust based on traffic -
Adjust sampling rate dynamically
- High traffic: 0.01-0.1 (1-10%)
- Medium traffic: 0.1-0.5 (10-50%)
- Low traffic: 0.5-1.0 (50-100%)
-
Use batch exporters (default)
- Spans are batched before export
- Reduces network overhead
-
Monitor exporter health
- Watch for export failures in logs
- Configure retry policies
Troubleshootingâ
Traces Not Appearingâ
- Check tracing is enabled:
observability:
tracing:
enabled: true
- Verify exporter endpoint:
# Test OTLP endpoint connectivity
telnet jaeger 4317
- Check logs for errors:
Failed to export spans: connection refused
Missing Spansâ
- Check sampling rate:
sampling:
type: "probabilistic"
rate: 1.0 # Increase to see more traces
- Verify span creation in code:
- Spans are created at key processing points
- Check for nil context
High Memory Usageâ
- Reduce sampling rate:
sampling:
rate: 0.01 # 1% sampling
- Verify batch exporter is working:
- Check export interval
- Monitor queue length
Best Practicesâ
-
Start with stdout in development
- Easy to verify tracing works
- No external dependencies
-
Use probabilistic sampling in production
- Balances visibility and performance
- Start with 10% and adjust
-
Set meaningful service names
- Use environment-specific names
- Include version information
-
Add custom attributes for your use case
- Customer IDs
- Deployment region
- Feature flags
-
Monitor exporter health
- Track export success rate
- Alert on high failure rates
-
Correlate with metrics
- Use same service name
- Cross-reference trace IDs in logs
Integration with vLLM Stackâ
Future Enhancementsâ
The tracing implementation is designed to support future integration with vLLM backends:
- Trace context propagation to vLLM
- Correlated spans across router and engine
- End-to-end latency analysis
- Token-level timing from vLLM
Stay tuned for updates on vLLM integration!