Trace Correlation
How BlameTrail automatically correlates distributed traces with incidents and suspect deploys.
Trace Correlation
When an incident has a suspect deploy, BlameTrail automatically queries your connected trace providers to find traces that may be related to the failure. Traces are scored by relevance and presented alongside a span waterfall view and latency regression analysis.
How it works
- An incident fires and suspect scoring identifies a suspect deploy.
- BlameTrail computes a time window around the deploy (before and after).
- Each connected trace provider is queried in parallel for traces within that window.
- Traces are deduplicated, scored, and ranked by relevance.
- Results are displayed on the incident detail page in the Trace Exploration panel.
Supported providers
Any of the following trace-capable providers can be used:
- Tempo — Grafana Tempo HTTP API
- Jaeger — Jaeger Query API
- Datadog — Datadog Trace Search API
- Honeycomb — Honeycomb Query API
- New Relic — NRQL-based trace queries
- Elastic APM — Elasticsearch APM index queries
- AWS X-Ray — X-Ray GetTraceSummaries / BatchGetTraces
- Lightstep — Lightstep Snapshot and Stored Traces API
Relevance scoring
Each trace receives a relevance score (0-100) based on four weighted factors:
| Factor | Max points | Description |
|---|---|---|
| Error density | 40 | Proportion of spans with errors in the trace |
| Temporal proximity | 25 | How close the trace start time is to the deploy time |
| Latency deviation | 20 | How much the trace duration deviates from the median |
| Service relevance | 15 | Whether the trace includes spans from the incident's service |
Traces with higher scores appear first. Clicking a trace expands its span waterfall view showing the full request flow across services.
Latency regression detection
BlameTrail compares latency percentiles (p50, p95, p99) for each operation before and after the suspect deploy. If a percentile increases beyond the configured threshold, the operation is flagged as a regression.
Regressions are classified by severity:
| Severity | Criteria |
|---|---|
| Minor | One percentile exceeds the threshold |
| Moderate | Two percentiles exceed the threshold |
| Severe | All three percentiles exceed the threshold, or any single percentile doubles |
The latency regression panel appears below the trace list when regressions are detected.
Requirements
- At least one trace-capable provider connected and mapped to the incident's service
- A suspect deploy identified for the incident
- A BlameTrail Starter or Pro plan
Next steps
- Connecting Providers — Set up a trace provider
- Service Mappings — Map providers to services