Skip to content

Observability

The LLM Service Daemon (LSD) writes every inference, every provider response, and every feedback signal to the same PostgreSQL database that holds its config. There is no separate analytics store to run or keep in sync.

Table family Contents
chat_inferences, json_inferences, embedding_inferences One row per inference, by output type (partitioned by time)
chat_inference_data, json_inference_data, model_inference_data Raw request/response payloads backing the above
model_inferences Per-provider-attempt telemetry (latency, tokens, which provider actually served the request)
batch_requests, batch_model_inferences Batch inference jobs and their results
boolean_metric_feedback, float_metric_feedback, comment_feedback, demonstration_feedback Feedback attached to an inference or episode
inference_evaluation_runs, inference_evaluation_human_feedback Evaluation run results and any human feedback collected for them

Materialized aggregates are refreshed automatically for dashboards and cost tracking: inference_by_function_statistics, variant_statistics, model_provider_statistics, and per-minute/per-hour model_latency_histogram_*.

Postgres is a hard dependency, not an optional sink. Config, auth, rate limiting, and observability all share one connection, so LSD_DATABASE_URL must point at a reachable database or the gateway refuses to start.

What you can toggle is whether observability rows specifically get written to that already-required connection:

[gateway.observability]
enabled = true # write inference/feedback rows (default: true)
async_writes = true # don't block the response on the write (default in production)
[gateway.observability.batch_writes]
enabled = true
flush_interval_ms = 100
max_rows = 1000

Set enabled = false to skip writing observability data entirely, for example in a load-testing setup where you want Postgres for config/auth but don’t want every inference recorded.

Async writes avoid adding write latency to the request path; batch writes coalesce many rows into fewer Postgres round-trips under load.

Terminal window
curl -X POST http://localhost:3000/v1/inferences/list_inferences \
-H "Content-Type: application/json" \
-d '{"function_name": "my_function", "limit": 100}'

POST /v1/inferences/list_inferences and POST /v1/inferences/get_inferences let you query stored inferences programmatically. No need to hand-write SQL against the partitioned tables, though you’re always free to.

  • OTLP traces: enable with gateway.export.otlp.traces.enabled = true and point OTEL_EXPORTER_OTLP_TRACES_ENDPOINT at your collector. Spans are created per inference, batch, and feedback request.
  • Prometheus: scrape GET /metrics for request counts, latency histograms, and per-provider stats.