# OpenTelemetry LLM Observability This document provides comprehensive guidance for using the OpenTelemetry LLM observability tool with the MCP server. It covers setup, configuration, usage, troubleshooting, and practical examples for all major OpenTelemetry-compatible backends. ## Features - **Universal Compatibility**: Works with Jaeger, New Relic, Grafana, Datadog, Honeycomb, and more - **Comprehensive Metrics**: Request counts, token usage, latency, error rates - **Distributed Tracing**: Full request lifecycle tracking with spans - **Flexible Configuration**: Environment-based configuration for different backends - **Zero-Code Integration**: Drop-in replacement for existing observability tools --- ## Quick Start ### 1. Install Dependencies OpenTelemetry dependencies are included in `package.json`: ```bash npm install ``` ### 2. Configure Your Backend #### Jaeger (Local Development) ```bash docker run -d --name jaeger \ -e COLLECTOR_OTLP_ENABLED=true \ -p 16686:16686 \ -p 4317:4317 \ -p 4318:4318 \ jaegertracing/all-in-one:latest export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 export OTEL_SERVICE_NAME=llm-observability-mcp ``` #### New Relic ```bash export OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp.nr-data.net:4318 export OTEL_EXPORTER_OTLP_HEADERS="api-key=YOUR_NEW_RELIC_LICENSE_KEY" export OTEL_SERVICE_NAME=llm-observability-mcp ``` #### Grafana Cloud ```bash export OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod-us-central-0.grafana.net/otlp export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic $(echo -n YOUR_INSTANCE_ID:YOUR_API_KEY | base64)" export OTEL_SERVICE_NAME=llm-observability-mcp ``` #### Honeycomb ```bash export OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io/v1/traces export OTEL_EXPORTER_OTLP_HEADERS="x-honeycomb-team=YOUR_API_KEY" export OTEL_SERVICE_NAME=llm-observability-mcp ``` #### Datadog ```bash export OTEL_EXPORTER_OTLP_ENDPOINT=https://api.datadoghq.com/api/v2/series export OTEL_EXPORTER_OTLP_HEADERS="DD-API-KEY=YOUR_DD_API_KEY" export OTEL_SERVICE_NAME=llm-observability-mcp ``` #### Lightstep ```bash export OTEL_EXPORTER_OTLP_ENDPOINT=https://ingest.lightstep.com:443/api/v2/otel/trace export OTEL_EXPORTER_OTLP_HEADERS="lightstep-access-token=YOUR_ACCESS_TOKEN" export OTEL_SERVICE_NAME=llm-observability-mcp ``` #### Kubernetes Example ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: llm-observability-mcp spec: replicas: 3 selector: matchLabels: app: llm-observability-mcp template: metadata: labels: app: llm-observability-mcp spec: containers: - name: llm-observability-mcp image: llm-observability-mcp:latest ports: - containerPort: 3000 env: - name: OTEL_SERVICE_NAME value: "llm-observability-mcp" - name: OTEL_SERVICE_VERSION value: "1.2.3" - name: OTEL_ENVIRONMENT value: "production" - name: OTEL_EXPORTER_OTLP_ENDPOINT value: "https://your-backend.com:4318" - name: OTEL_EXPORTER_OTLP_HEADERS valueFrom: secretKeyRef: name: otel-credentials key: headers ``` --- ## Running the MCP Server ```bash # Start with stdio transport npm run mcp:stdio # Start with HTTP transport npm run mcp:http ``` --- ## Usage ### OpenTelemetry Tool: `llm_observability_otel` #### Required Parameters - `userId`: The distinct ID of the user - `model`: The model used (e.g., "gpt-4", "claude-3") - `provider`: The LLM provider (e.g., "openai", "anthropic") #### Optional Parameters - `traceId`: Trace ID for grouping related events - `input`: The input to the LLM (messages, prompt, etc.) - `outputChoices`: The output from the LLM - `inputTokens`: Number of tokens in the input - `outputTokens`: Number of tokens in the output - `latency`: Latency of the LLM call in seconds - `httpStatus`: HTTP status code of the LLM call - `baseUrl`: Base URL of the LLM API - `operationName`: Name of the operation being performed - `error`: Error message if the request failed - `errorType`: Type of error (e.g., "rate_limit", "timeout") - `mcpToolsUsed`: List of MCP tools used during the request #### Example Usage ```json { "tool": "llm_observability_otel", "arguments": { "userId": "user-12345", "model": "gpt-4", "provider": "openai", "inputTokens": 150, "outputTokens": 75, "latency": 2.3, "httpStatus": 200, "operationName": "chat-completion", "traceId": "trace-abc123", "input": "What is the weather like today?", "outputChoices": ["The weather is sunny and 75°F today."] } } ``` --- ## Configuration Reference | Variable | Description | Default | |----------|-------------|---------| | `OTEL_SERVICE_NAME` | Service name for OpenTelemetry | `llm-observability-mcp` | | `OTEL_SERVICE_VERSION` | Service version | `1.0.0` | | `OTEL_ENVIRONMENT` | Environment name | `development` | | `OTEL_EXPORTER_OTLP_ENDPOINT` | Default OTLP endpoint | - | | `OTEL_EXPORTER_OTLP_METRICS_ENDPOINT` | Metrics endpoint | - | | `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` | Traces endpoint | - | | `OTEL_EXPORTER_OTLP_LOGS_ENDPOINT` | Logs endpoint | - | | `OTEL_EXPORTER_OTLP_HEADERS` | Headers for authentication (format: "key1=value1,key2=value2") | - | | `OTEL_METRIC_EXPORT_INTERVAL` | Metrics export interval in ms | `10000` | | `OTEL_METRIC_EXPORT_TIMEOUT` | Metrics export timeout in ms | `5000` | | `OTEL_TRACES_SAMPLER_ARG` | Sampling ratio (0.0-1.0) | `1.0` | --- ## Metrics Collected - `llm.requests.total`: Total number of LLM requests - `llm.tokens.total`: Total tokens used (input + output) - `llm.latency.duration`: Request latency in milliseconds - `llm.requests.active`: Number of active requests ### Trace Attributes - `llm.model`, `llm.provider`, `llm.user_id`, `llm.operation`, `llm.input_tokens`, `llm.output_tokens`, `llm.total_tokens`, `llm.latency_ms`, `llm.http_status`, `llm.base_url`, `llm.error`, `llm.error_type`, `llm.input`, `llm.output`, `llm.mcp_tools_used` --- ## Practical Examples ### Jaeger: View Traces Open to see your traces. ### Error Tracking Example ```json { "tool": "llm_observability_otel", "arguments": { "userId": "user-12345", "model": "gpt-4", "provider": "openai", "httpStatus": 429, "error": "Rate limit exceeded", "errorType": "rate_limit", "latency": 0.1, "operationName": "chat-completion" } } ``` ### Multi-Tool Usage Tracking Example ```json { "tool": "llm_observability_otel", "arguments": { "userId": "user-12345", "model": "gpt-4", "provider": "openai", "inputTokens": 500, "outputTokens": 200, "latency": 5.2, "httpStatus": 200, "operationName": "complex-workflow", "mcpToolsUsed": ["file_read", "web_search", "code_execution"], "traceId": "complex-workflow-123" } } ``` ### Testing Script ```bash #!/bin/bash # test-opentelemetry.sh docker run -d --name jaeger-test \ -e COLLECTOR_OTLP_ENABLED=true \ -p 16686:16686 \ -p 4318:4318 \ jaegertracing/all-in-one:latest sleep 5 export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 export OTEL_SERVICE_NAME=llm-observability-test export OTEL_ENVIRONMENT=test npm run mcp:stdio & sleep 3 curl -X POST http://localhost:3000/mcp \ -H "Content-Type: application/json" \ -d '{ "tool": "llm_observability_otel", "arguments": { "userId": "test-user", "model": "gpt-4", "provider": "openai", "inputTokens": 100, "outputTokens": 50, "latency": 1.5, "httpStatus": 200, "operationName": "test-completion" } }' echo "Test complete. View traces at http://localhost:16686" ``` --- ## Migration from PostHog The OpenTelemetry tool is a drop-in replacement for the PostHog tool. Both can coexist for gradual migration: - **PostHog Tool**: `llm_observability_posthog` - **OpenTelemetry Tool**: `llm_observability_otel` Both accept the same parameters. --- ## Troubleshooting & Performance ### Common Issues - No data in backend: check endpoint URLs, authentication, network, server logs - High resource usage: lower sampling (`OTEL_TRACES_SAMPLER_ARG`), increase export intervals - Missing traces: verify OpenTelemetry is enabled, check logs, service name ### Debug Mode ```bash export DEBUG=true npm run mcp:stdio ``` ### Performance Tuning - Reduce sampling for high-volume: `OTEL_TRACES_SAMPLER_ARG=0.01` - Increase export intervals: `OTEL_METRIC_EXPORT_INTERVAL=60000` - Disable metrics/logs if not needed: `unset OTEL_EXPORTER_OTLP_METRICS_ENDPOINT`, `unset OTEL_EXPORTER_OTLP_LOGS_ENDPOINT` --- ## Support For issues or questions: 1. Check this document and troubleshooting 2. Review server logs with `DEBUG=true` 3. Verify OpenTelemetry configuration 4. Test with Jaeger locally first