Pipeline Monitoring Runbook¶
谁用这本: on-call / operator / 任何想知道"系统现在 OK 吗 / 卡哪了 / AI 多少钱"的人
3 层结构: - Layer 1 — Dashboard 一眼: 90% 问题在这能答 - Layer 2 — CloudWatch Logs Insights query: 跨 Lambda trace 一通通话 - Layer 3 — Sentry / 直接 grep code: forensic + bug debug
Last updated: 2026-05-05(PR #745 / #724 — end-to-end SLO
end_to_end_duration_ms)
TL;DR — 客户问 "我那通通话怎么了"¶
1. 打开 dashboard (URL 见下面 §Quick links)
→ "Pipeline Throughput" panel 看整体健康
→ "Per-Stage ProcessingDuration p95" 看哪段慢
→ "SQS Queue Depth" 看有没有积压
能解释 → done(2 分钟)
不能 → ↓
2. CloudWatch Logs Insights(URL 见 §Quick links)
→ run "trace one call" query (见 §Layer 2)
→ 30 秒得到 4 段 timeline
能解释 → done(5 分钟)
不能 → ↓
3. Sentry(URL 见 §Quick links)
→ 按 telephonySessionId tag 找 events
→ 看 stack trace + breadcrumb
还要更深 → 直接读 Lambda CloudWatch Log group
Quick links(每天打开这些)¶
| 工具 | URL pattern | 你看什么 |
|---|---|---|
| CloudWatch Dashboard | https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=call-analytics-prod-monitoring |
整体健康 + 4 段 pipeline + AB test |
| CloudWatch Logs Insights | https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:logs-insights |
跨 Lambda trace + 自定义 query |
| CloudWatch Metrics 控制台 | https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#metricsV2: |
自定义 metric chart, AB test 对比 |
| Sentry | https://retaintive.sentry.io |
Error stack trace + transaction |
| Discord webhook channel | (team Discord channel) | 实时 alert(logger.error 触发) |
| Lark webhook channel | (team Lark group) | 同 Discord,中文友好 |
Pipeline 全貌(必读 — 看 dashboard 之前先记这个)¶
RC 客户来电
↓ (RC webhook → 另一个 repo: ringcentralSubscriptionService)
SQS:transcribe-queue
↓ ←── delay 5 min,然后 trigger Lambda
[Lambda 1] transcribe-processor(15 min timeout, 2 concurrency)
↓ S3 写录音 + Deepgram(主)/ AWS Transcribe(fallback)
↓ Neon: calls 表插入 49 RC metadata field
SQS:ai-analysis-queue
↓
[Lambda 2] ai-analysis-processor(15 min, 5 concurrency)
↓ OneRouter(Grok 4.1 / DeepSeek V4 Flash AB test)
↓ DDB write + Neon: calls + contacts + timeline 原子 batch
SQS:contacts-analyzer-queue (per-call FIFO)
↓
[Lambda 3] contacts-analyzer(15 min, 5 concurrency)
↓ AI 分析 + Neon: tasks + contact_timeline updates
DONE — task 出现在 dashboard
并行(SMS 路径):
RC SMS webhook → SQS:message-queue → [Lambda 4] message-processor → Neon
4 段都有 ProcessingDuration metric,3 段有 NeonWriteDuration,ai-analysis + contacts-analyzer 有 AIInvokeDuration。
Layer 1 — Dashboard 一眼(90% 问题在这答)¶
Dashboard 名: call-analytics-prod-monitoring(只在 prod 渲染)。
按"我要知道什么"分组找 panel:
🔴 "现在系统挂了吗?" — 第一眼¶
| Panel | 红旗信号 |
|---|---|
| Pipeline Throughput(4 stage success overlay) | 一条线突然降到 0 → 那段挂了 |
| Pipeline Failure Rate per Stage | 失败计数 spike → 看是哪段 |
| Lambda Errors per Function | Errors > 0 → 那个 Lambda crash |
| SQS Queue Depth | Messages > 50 → 队列堵了 |
| SQS Oldest Message Age | > 600 sec → 严重积压 |
🐌 "系统慢吗?哪段慢?"¶
| Panel | 红旗信号 |
|---|---|
| Per-Stage ProcessingDuration p95(4 Lambda overlay) | 哪条线突然涨 → 那段慢 |
| Stage Bottleneck Breakdown | AIInvoke / NeonWrite / SqsQueueWait / AiToContactDelay 四线对比 → 找 bottleneck 类型 |
| Lambda Duration p99 | 接近 timeout(transcribe 15min, ai-analysis 15min) = 危险 |
| Lambda Concurrent Executions p95 | 接近 reservedConcurrency(transcribe=2, ai-analysis=5) → 限流 |
💰 "今天 AI 花了多少钱?"¶
| Panel | 看 |
|---|---|
| AI Cost (USD per 5min) | 累加 = 当天 cost |
| AI Cost per Model | DeepSeek vs Grok 哪个贵 |
| AI Tokens (input + output) | token 量,异常 spike = bug 或重试爆炸 |
🧪 "DeepSeek AB test 跑得怎样?"(PR #721)¶
| Panel | 看 |
|---|---|
| AI Cost per Model | 两条线:Grok vs DeepSeek 单价对比 |
| AI Latency p95 per Model | DeepSeek 是不是更慢 |
| AI Quality — Zod Validation | DeepSeek 输出 schema 合格率(green = pass) |
| AI Avg Retry Attempts per Model | 平均重试次数 = 间接质量信号 |
📊 "业务流量正常吗?"¶
| Panel | 看 |
|---|---|
| Pipeline Throughput | 历史 baseline vs 今天 |
| OneRouter Errors by Type | OutputTruncated / JsonParseError / Timeout / HttpError 分类 |
⏱️ "webhook 到 task 总共要几分钟?"(End-to-end SLO,PR #745)¶
业务问题: "上周 p95 客户从挂电话到看到 task 几分钟"。
以前不能答 — 各 Lambda 各自 emit
ProcessingDuration,但 stage duration 加和漏掉队列等待,业界 OpenTelemetry / Datadog consensus 不可加。现在能答 —
end_to_end_duration_msmetric 在 contacts-analyzer 末端 emitDate.now() - webhookReceivedAt,真 wall-clock 时间(包含所有队列等待 + cold start + clock skew)。
| Panel | 看 |
|---|---|
| Pipeline E2E Latency p50/p95/p99 (webhook → task) | p95 持续 > 30s = 链路某段慢;突然 spike = 某 Lambda 卡 / SQS 堵 |
Coverage 警告:
- ~95% Deepgram + ~3% voicemail = 走 SLO 主指标
- ~2% AWS Transcribe fallback 不算进 SLO(架构 constraint:AWS service 自己写 transcript JSON,我们 hook 不到 metadata sidecar)
- 没覆盖到的 call 在 ai-analysis 的 log 里有
e2e_attribution: 'no_propagation_field'tag;Layer 2 query 6 可以 count uncovered slice
判断标准没 baseline 之前不要 page — ship 1 周后看真实 p95 才设 alarm 阈值。
Layer 2 — CloudWatch Logs Insights(dashboard 不够时)¶
核心场景: 客户给你一个 phone 或 telephonySessionId, 你想看完整 timeline。
必备 query: 跨 4 Lambda trace 一通通话¶
CloudWatch Logs Insights 选 4 个 log group:
- /aws/lambda/call-analytics-prod-transcribe-processor-ts-us-east-1
- /aws/lambda/call-analytics-prod-ai-analysis-processor-ts-us-east-1
- /aws/lambda/call-analytics-prod-contacts-analyzer-us-east-1
- /aws/lambda/call-analytics-prod-message-processor-us-east-1
Query 1: 按 telephonySessionId 找完整 trace¶
fields @timestamp, @log, @message, telephonySessionId, rootCorrelationId, phone, storeId
| filter telephonySessionId = "session-abc123"
| sort @timestamp asc
→ 30 秒看到 4 段 pipeline 完整 timeline。@log 字段告诉你哪个 Lambda log 来的。
Query 2: 按客户 phone 找 trace(没有 telephonySessionId)¶
fields @timestamp, @log, @message, telephonySessionId, phone
| filter phone = "+19142654371"
| sort @timestamp asc
Query 3: 按 webhook UUID 找 trace(从 RC webhook 起源)¶
fields @timestamp, @log, @message, rootCorrelationId, telephonySessionId
| filter rootCorrelationId = "webhook-uuid-xxx"
| sort @timestamp asc
rootCorrelationId = RC webhook UUID,4 段 Lambda 都带它。
Query 4: 找所有今天 error¶
fields @timestamp, @log, @message, telephonySessionId, phone
| filter level = "ERROR"
| sort @timestamp desc
| limit 50
Query 5: 找 AI 慢调用(latency 异常)¶
fields @timestamp, telephonySessionId, model, AIInvokeDuration
| filter AIInvokeDuration > 30000
| sort AIInvokeDuration desc
→ 哪通通话 AI call > 30 秒。
Query 6: 找 contacts-analyzer 收到 SQS 但等很久才处理(AiToContactDelay)¶
fields @timestamp, telephonySessionId, AiToContactDelayMs
| filter AiToContactDelayMs > 60000
| sort AiToContactDelayMs desc
→ 哪通通话从 ai-analysis 完到 contacts-analyzer 拿到等了 > 60 秒(SQS 积压信号)。
Query 7: 跨 Lambda 重建一通通话完整 timeline(PR #745 — pipeline.milestone wide event)¶
fields @timestamp, milestone, telephony_session_id, webhook_received_at, transcribe_path, end_to_end_duration_ms, e2e_attribution
| filter @message like /pipeline.milestone/
| filter telephony_session_id = "s-abc123..."
| sort @timestamp asc
→ 拿到 3-4 条 milestone(每 Lambda 1 条),按时间顺序看清楚每段花了多久:
2026-05-04 10:00:05.123 transcribe_complete transcribe_path=deepgram webhook_received_at=...
2026-05-04 10:00:08.456 ai_analysis_complete webhook_received_at=...
2026-05-04 10:00:11.789 task_processing_complete end_to_end_duration_ms=11666
几个常见用法:
- 客户问 "为什么这通通话 30 分钟还没 task" → query → 看哪个 milestone 没出现 → 那段卡了
- 想 verify e2e SLO 这通是不是健康 → 末端
end_to_end_duration_ms直接拿到 - count 多少 calls 走 AWS Transcribe path(没覆盖 SLO) →
filter e2e_attribution = "no_propagation_field" | stats count()
Query 8: count #724 SLO coverage(每天有多少 calls 进 SLO 主指标 vs uncovered)¶
fields @timestamp, milestone, e2e_attribution
| filter @message like /pipeline.milestone/
| filter milestone in ["task_processing_complete", "task_processing_skipped", "ai_analysis_complete"]
| stats count(*) by milestone, e2e_attribution
→ 看 7 天内:
task_processing_complete + e2e_attribution=nullcount = 真正进 SLO 主指标的 callstask_processing_skipped + e2e_attribution=no_task_generatedcount = contact-not-found 没生成 task(rollout gap)ai_analysis_complete + e2e_attribution=no_propagation_fieldcount = AWS Transcribe fallback 不可覆盖
如果 covered ratio 突然降到 < 90%,说明 transcribe-processor 那边 webhookReceivedAt 没传过来 — 可能上游 _routed_at 出问题或 webhook-parser 改坏了。
Tips¶
- Insights query 最长保留 7 天(默认 retention)。要看更早,可以 query 更老 log group(retention 配置不同)
- 复杂 query → 保存为 "Saved query"(右上角 save 按钮),团队 share
- 一次 query 多 log group:左侧 "Select log group(s)" 多选
Layer 3 — Sentry(forensic + stack trace)¶
何时用 Sentry vs CloudWatch:
- Sentry: 我有 Error.stack, 我要看是哪行代码炸 + breadcrumb 历史
- CloudWatch Logs: 我要看时间线 + 跨 Lambda trace + 自定义字段
进 Sentry¶
URL: https://retaintive.sentry.io
找事件¶
- 按 telephonySessionId: search bar 输
tags:telephonySessionId:session-abc123 - 按 phone:
tags:phone:+19142654371 - 按 Lambda: filter project = ai-analysis-processor / transcribe-processor / 等
- 按 error type: filter
error.type:ZodError/error.type:OneRouterError
看一条事件能告诉你¶
- 完整 stack trace(哪行代码 throw)
- Breadcrumb(error 之前几行 log)
- Lambda invocation context(memoryLimit, executionTime, awsRequestId)
- 关联的其他 events(同 telephonySessionId 的所有 errors)
"我现在该怎么 debug X" — 场景 cheatsheet¶
场景 A: 客户说 "我那通通话 30 分钟还没出 task"¶
1. Layer 1 dashboard:
- SQS Queue Depth — 队列积压?
- Pipeline Throughput — 哪段 success 数没动?
2. Layer 2 Logs Insights:
- Query 1 — 用客户给的 sessionId trace 全链路
- 找 last log entry 看停在哪个 Lambda
3. 如果 last log 是 ai-analysis:
- Query 5 — 是不是 AI timeout / retry 爆炸
- 看 AIInvokeDuration 异常没
4. 如果完全没 log(ai-analysis 都没收到):
- 看上游 transcribe-processor log
- 或更上游 ringcentralSubscriptionService(另一个 repo)
场景 B: dashboard 上 "Pipeline Failure Rate" spike¶
1. Layer 1: 哪段 spike?
- 看 "Pipeline Failure Rate per Stage" 颜色(orange/red/purple/pink)
2. Layer 2: 找原因
- filter failure timeframe + Lambda log
- Query 4 找当时所有 ERROR log
3. Layer 3 Sentry:
- filter project = 出问题的 Lambda
- 按 timestamp 找 cluster of errors
- 看是不是同一个 stack trace = 同一 bug
场景 C: "今天 AI cost 突然涨了"¶
1. Layer 1: dashboard "AI Cost per Model"
- 哪个 model 涨?是 Grok 还是 DeepSeek?
- "AI Tokens" 是不是同步涨?
2. Layer 2 Logs Insights:
- 找 retry 异常多的 case
- Query: filter model = "X" + retry > 1
- 是不是 Zod validation 频繁失败导致重试
3. 看 "AI Avg Retry Attempts per Model" panel
- DeepSeek retry 多 → 它输出 schema 不对的频率高
- = AB test 上 DeepSeek 质量差
场景 D: 收到 Discord alert "DLQ has messages"¶
⚠️ 当前已知 issue(2026-05-04): monitoring-stack.ts 的 DLQ alarm 走 SnsAction → SNS topic 没真订阅者 → alarm 是 silent 的。
实际上你不会收到 Discord alert about DLQ。修复见 issue #723 alarm-dispatcher。
当前临时:每天上班手动看 SQS console DLQ depth。或看 dashboard "SQS Queue Depth" panel(需要进 dashboard,不会主动 page 你)。
场景 E: "AI 输出 JSON 解析失败"(Zod error)¶
1. Layer 2 Logs Insights:
- Query: filter @message like /Zod validation failed/
- 看 zodErrors 字段 → 哪个 field 解析失败
2. Layer 3 Sentry:
- filter error.type:ZodError
- 看 stack + breadcrumb 找 prompt 改动 / model 行为变化
3. Dashboard:
- "AI Quality — Zod Validation Pass/Failure" 看趋势
- 如果突然增多 = 模型出问题或 prompt 变化导致
场景 F: "Lambda timeout"¶
1. Dashboard "Lambda Duration p99" — 哪个 Lambda 接近 timeout?
2. Logs Insights:
- filter @duration > 800000 (> 13 min, 接近 15 min timeout)
- 看是哪通 call
3. 如果是 ai-analysis:
- 大概率 AI call 自己 timeout(150s × 4 retry = 10 min)
- Query 5 看 AIInvokeDuration
4. 如果是 contacts-analyzer:
- cron 模式可能跑很多 contact
- 看 batch size 或调 reservedConcurrency
Metric namespace 速查表¶
CloudWatch Metrics 的 namespace 决定你在哪找 metric。同 metric 名在不同 namespace 不通查。
| Namespace | 谁发 | 主要 metric |
|---|---|---|
CallAnalytics/RingCentral |
transcribe-processor + reconciliation-worker | ProcessingDuration, ProcessingSuccess/Failure, SqsQueueWaitMs, NeonWriteDuration, FirstAttempt/RetryMessages, RateLimitErrors, TranscriptionDuration |
CallAnalytics/AI |
ai-analysis-processor | ProcessingDuration, AICostUsd, InputTokens, OutputTokens, AIInvokeDuration, NeonWriteDuration, OneRouterError, ZodValidationPass/Failure, AttemptCount |
CallAnalytics/ContactsAnalyzer |
contacts-analyzer(PR #720 新加) | ProcessingDuration, ProcessingSuccess/Failure, AIInvokeDuration, NeonWriteDuration, AiToContactDelayMs, end_to_end_duration_ms(PR #745 #724 — snake_case,首个新命名规范 metric) |
CallAnalytics/MessageProcessor |
message-processor(PR #720 新加) | ProcessingDuration, ProcessingSuccess/Failure, NeonWriteDuration |
CallAnalytics/ConfigManager |
config-manager | (admin path,少用) |
AWS/Lambda |
AWS 自带 | Errors, Throttles, ConcurrentExecutions, Duration, Invocations |
AWS/SQS |
AWS 自带 | ApproximateNumberOfMessagesVisible, ApproximateAgeOfOldestMessage |
Dimension(filter 维度):
- AI metric 多数有 Model (Grok-4-1-fast-reasoning / deepseek-v4-flash) → AB test 对比
- AI metric 部分有 AccountId, Franchise → 多租户分组(目前主要是 OrangeTheory)
- Lambda metric 有 FunctionName
- SQS metric 有 QueueName
"我从代码哪里看 metric / log 的实现"¶
如果你想知道 dashboard 一个 number 是从哪行代码 emit 的:
| Dashboard panel | Source code |
|---|---|
| AI Cost / Tokens | lambda/ai-analysis-processor/src/utils/metrics.ts emitAICostMetrics |
| AI ProcessingDuration | 同上 emitProcessingDurationMetrics |
| AI Quality(Zod / Retry) | 同上 emitZodValidationMetric / emitRetryCountMetric (PR #722 新) |
| AIInvokeDuration | lambda/ai-analysis-processor/src/infrastructure/onerouter/client.ts:148 emitAIInvokeDuration |
| NeonWriteDuration(ai) | lambda/ai-analysis-processor/src/infrastructure/neon-repository.ts:311 |
| Per-stage Success/Failure | 4 个 Lambda 的 src/utils/metrics.ts 都有 emitProcessingMetrics |
| SqsQueueWaitMs(transcribe) | lambda/transcribe-processor/src/record-processor.ts:298 emitSqsQueueWaitMs |
| AiToContactDelayMs | lambda/contacts-analyzer/src/handler.ts:466 emitAiToContactDelayMs |
| OneRouterError | lambda/ai-analysis-processor/src/handler.ts:840 emitOneRouterErrorMetrics |
| Lambda Errors / Throttles / Duration | AWS 自带,无 emit code |
| SQS Queue Depth / Age | AWS 自带,无 emit code |
Dashboard widget 定义:lib/stacks/monitoring-stack.ts line 1633+
常见 confusion(踩过 3 次以上的坑)¶
"Dashboard 没数据"¶
- 检查 1: 你看的是哪个 env? Dashboard 只在 prod 创建(
if (!isProduction) return;guard) - 检查 2: metric namespace 对吗? 不在
CallAnalytics/RingCentral的 metric 不会被那个 panel 显示 - 检查 3: 时间范围? 默认 dashboard 是 3h, 调到 12h 或 24h
"Pre/test env 没 metric"¶
- PR-2 #720 之后已修复 — 3 个 Lambda 移除了 non-prod early return,pre/test 也 emit metric
- 但 pre/test 没 dashboard(只 prod 有)。你只能直接进 CloudWatch Metrics 控制台看
- Logs Insights 全 env 都有
"看不到某个 telephonySessionId 的 trace"¶
- transcribe-processor 不一定写 telephonySessionId(它先解析才有);用 webhookUuid → transcribe → 拿到 sessionId → 后续 Lambda
- 用
rootCorrelationId(=webhookUuid)trace 更稳
"Discord/Lark alert 没收到"¶
logger.error才会触发,logger.warn不会(故意)- 如果是 CloudWatch Alarm 触发的,目前 silent(issue #719/#723) — 不会 page 你
升级日志(本 runbook 维护)¶
| 日期 | 改动 | 备注 |
|---|---|---|
| 2026-05-04 | 初版,基于 PR #718 + #720 + #722 merged | 11 dashboard panel + 7 metric helper + correlation 跨 4 Lambda |
| 2026-05-05 | 加 §"webhook 到 task 总共要几分钟" + Query 7-8 + end_to_end_duration_ms metric(PR #745 / #724) |
新 dashboard panel Pipeline E2E Latency p50/p95/p99 + pipeline.milestone wide-event log 跨 3 Lambda + Coverage ~98%(AWS Transcribe ~2% gap with attribution tag) |
相关文档¶
Other runbooks(同 runbooks/ 目录)¶
- lambda-concurrency-decisions.md — concurrency 设计决策
- AI analysis Lambda、TranscribeProcessor、Reconciliation 三份 runbook 暂未 publish(stale internal link 待清理),GitHub source 可读:ai-analysis.md / transcribe-processor.md / reconciliation.md
Source repo(callytics-infrastructure)¶
- 设计 spec:
docs/superpowers/specs/2026-05-03-pipeline-observability-design.md - Metric namespace audit:
docs/observability/metric-namespace-audit-2026-05-03.md - Naming convention(新加 metric / log 必看):
docs/observability-naming-convention.md(PR #745) - 主要 PR(merged):
- #718 — dashboard AI panels
- #720 — observability core(correlation + metrics)
- #722 — monitoring completeness(11 dashboard panel + this runbook)
- #745 — end-to-end SLO
end_to_end_duration_ms+pipeline.milestonewide event + naming convention(closes #724 + #742)
Follow-up issues¶
- callytics-infra#723 alarm-dispatcher(close alert delivery loop)
- callytics-infra#724 end-to-end SLO timestamp
- callytics-infra#725 cross-repo logger unify