OSS Temporal Service metrics reference
The information on this page is relevant to open source Temporal Service deployments.
See Cloud metrics for metrics emitted by Temporal Cloud.
See SDK metrics for metrics emitted by the SDKs.
A Temporal Service emits a range of metrics to help operators get visibility into the Temporal Service's performance and to set up alerts.
All metrics emitted by the Temporal Service are listed in metric_defs.go.
For details on setting up metrics in your Temporal Service configuration, see the Temporal Service configuration reference.
The dashboards repository contains community-driven Grafana dashboard templates that can be used as a starting point for monitoring the Temporal Service and SDK metrics. You can use these templates as references to build your own dashboards. For any metrics that are missing in the dashboards, use metric_defs.go as a reference.
Note that, apart from these metrics emitted by the Temporal Service, you should also monitor infrastructure-specific metrics like CPU, memory, and network for all hosts that are running Temporal Service services.
Common metrics
Temporal emits metrics for each gRPC service request.
These metrics are emitted with type
, operation
, and namespace
tags, which provide visibility into Service usage and show the request rates across Services, Namespaces, and Operations.
- Use the
operation
tag in your query to get request rates, error rates, or latencies per operation. - Use the
service_name
tag with the service role tag values to get details for the specific service.
All common tags that you can add in your query are defined in the metric_defs.go file.
For example, to see service requests by operation on the Frontend Service, use the following:
sum by (operation) (rate(service_requests{service_name="frontend"}[2m]))
Note: All metrics queries in this topic are Prometheus queries.
The following list describes some metrics you can get started with.
service_requests
Shows service requests received per Task Queue.
Example: Service requests by operation
sum(rate(service_requests{operation=\"AddWorkflowTask\"}[2m]))
service_latency
Shows latencies for all Client request operations.
Usually these are the starting point to investigate which operation is experiencing high-latency issues.
Example: P95 service latency by operation for the Frontend Service
histogram_quantile(0.95, sum(rate(service_latency_bucket{service_name="frontend"}[5m])) by (operation, le))
service_error_with_type
(Available only in v1.17.0+) Identifies errors encountered by the service.
Example: Service errors by type for the Frontend Service
sum(rate(service_errors_with_type{service_name="frontend"}[5m])) by (error_type)
client_errors
An indicator for connection issues between different Server roles.
Example: Client errors
sum(rate(client_errors{service_name="frontend",service_role="history"}[5m]))
In addition to these, you can define some service-specific metrics to get performance details for each service. Start with the following list, and use metric_defs.go to define additional metrics as required.
Matching Service metrics
poll_success
Shows for Tasks that are successfully matched to a poller.
Example: sum(rate(poll_success{}[5m]))
poll_timeouts
Shows when no Tasks are available for the poller within the poll timeout.
Example: sum(rate(poll_timeouts{}[5m]))
asyncmatch_latency
Measures the time from creation to delivery for async matched Tasks.
The larger this latency, the longer Tasks are sitting in the queue waiting for your Workers to pick them up.
Example: histogram_quantile(0.95, sum(rate(asyncmatch_latency_bucket{service_name=~"matching"}[5m])) by (operation, le))
no_poller_tasks
Emitted whenever a task is added to a task queue that has no poller, and is a counter metric. This is usually an indicator that either the Worker or the starter programs are using the wrong Task Queue.