跳转至

Pipeline Monitoring Runbook

谁用这本: on-call / operator / 任何想知道"系统现在 OK 吗 / 卡哪了 / AI 多少钱"的人

3 层结构: - Layer 1 — Dashboard 一眼: 90% 问题在这能答 - Layer 2 — CloudWatch Logs Insights query: 跨 Lambda trace 一通通话 - Layer 3 — Sentry / 直接 grep code: forensic + bug debug

Last updated: 2026-05-05(PR #745 / #724 — end-to-end SLO end_to_end_duration_ms)


TL;DR — 客户问 "我那通通话怎么了"

1. 打开 dashboard (URL 见下面 §Quick links) 
   → "Pipeline Throughput" panel 看整体健康
   → "Per-Stage ProcessingDuration p95" 看哪段慢
   → "SQS Queue Depth" 看有没有积压

   能解释 → done(2 分钟)
   不能 → ↓

2. CloudWatch Logs Insights(URL 见 §Quick links)
   → run "trace one call" query (见 §Layer 2)
   → 30 秒得到 4 段 timeline

   能解释 → done(5 分钟)
   不能 → ↓

3. Sentry(URL 见 §Quick links)
   → 按 telephonySessionId tag 找 events
   → 看 stack trace + breadcrumb

   还要更深 → 直接读 Lambda CloudWatch Log group

Quick links(每天打开这些)

工具 URL pattern 你看什么
CloudWatch Dashboard https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=call-analytics-prod-monitoring 整体健康 + 4 段 pipeline + AB test
CloudWatch Logs Insights https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:logs-insights 跨 Lambda trace + 自定义 query
CloudWatch Metrics 控制台 https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#metricsV2: 自定义 metric chart, AB test 对比
Sentry https://retaintive.sentry.io Error stack trace + transaction
Discord webhook channel (team Discord channel) 实时 alert(logger.error 触发)
Lark webhook channel (team Lark group) 同 Discord,中文友好

Pipeline 全貌(必读 — 看 dashboard 之前先记这个)

RC 客户来电
    ↓ (RC webhook → 另一个 repo: ringcentralSubscriptionService)
SQS:transcribe-queue
    ↓ ←── delay 5 min,然后 trigger Lambda
[Lambda 1] transcribe-processor(15 min timeout, 2 concurrency)
    ↓ S3 写录音 + Deepgram(主)/ AWS Transcribe(fallback)
    ↓ Neon: calls 表插入 49 RC metadata field
SQS:ai-analysis-queue
[Lambda 2] ai-analysis-processor(15 min, 5 concurrency)
    ↓ OneRouter(Grok 4.1 / DeepSeek V4 Flash AB test)
    ↓ DDB write + Neon: calls + contacts + timeline 原子 batch
SQS:contacts-analyzer-queue (per-call FIFO)
[Lambda 3] contacts-analyzer(15 min, 5 concurrency)
    ↓ AI 分析 + Neon: tasks + contact_timeline updates
DONE — task 出现在 dashboard

并行(SMS 路径):
RC SMS webhook → SQS:message-queue → [Lambda 4] message-processor → Neon

4 段都有 ProcessingDuration metric,3 段有 NeonWriteDuration,ai-analysis + contacts-analyzer 有 AIInvokeDuration


Layer 1 — Dashboard 一眼(90% 问题在这答)

Dashboard 名: call-analytics-prod-monitoring(只在 prod 渲染)。

按"我要知道什么"分组找 panel:

🔴 "现在系统挂了吗?" — 第一眼

Panel 红旗信号
Pipeline Throughput(4 stage success overlay) 一条线突然降到 0 → 那段挂了
Pipeline Failure Rate per Stage 失败计数 spike → 看是哪段
Lambda Errors per Function Errors > 0 → 那个 Lambda crash
SQS Queue Depth Messages > 50 → 队列堵了
SQS Oldest Message Age > 600 sec → 严重积压

🐌 "系统慢吗?哪段慢?"

Panel 红旗信号
Per-Stage ProcessingDuration p95(4 Lambda overlay) 哪条线突然涨 → 那段慢
Stage Bottleneck Breakdown AIInvoke / NeonWrite / SqsQueueWait / AiToContactDelay 四线对比 → 找 bottleneck 类型
Lambda Duration p99 接近 timeout(transcribe 15min, ai-analysis 15min) = 危险
Lambda Concurrent Executions p95 接近 reservedConcurrency(transcribe=2, ai-analysis=5) → 限流

💰 "今天 AI 花了多少钱?"

Panel
AI Cost (USD per 5min) 累加 = 当天 cost
AI Cost per Model DeepSeek vs Grok 哪个贵
AI Tokens (input + output) token 量,异常 spike = bug 或重试爆炸

🧪 "DeepSeek AB test 跑得怎样?"(PR #721)

Panel
AI Cost per Model 两条线:Grok vs DeepSeek 单价对比
AI Latency p95 per Model DeepSeek 是不是更慢
AI Quality — Zod Validation DeepSeek 输出 schema 合格率(green = pass)
AI Avg Retry Attempts per Model 平均重试次数 = 间接质量信号

📊 "业务流量正常吗?"

Panel
Pipeline Throughput 历史 baseline vs 今天
OneRouter Errors by Type OutputTruncated / JsonParseError / Timeout / HttpError 分类

⏱️ "webhook 到 task 总共要几分钟?"(End-to-end SLO,PR #745)

业务问题: "上周 p95 客户从挂电话到看到 task 几分钟"。

以前不能答 — 各 Lambda 各自 emit ProcessingDuration,但 stage duration 加和漏掉队列等待,业界 OpenTelemetry / Datadog consensus 不可加。

现在能答end_to_end_duration_ms metric 在 contacts-analyzer 末端 emit Date.now() - webhookReceivedAt,真 wall-clock 时间(包含所有队列等待 + cold start + clock skew)。

Panel
Pipeline E2E Latency p50/p95/p99 (webhook → task) p95 持续 > 30s = 链路某段慢;突然 spike = 某 Lambda 卡 / SQS 堵

Coverage 警告:

  • ~95% Deepgram + ~3% voicemail = 走 SLO 主指标
  • ~2% AWS Transcribe fallback 不算进 SLO(架构 constraint:AWS service 自己写 transcript JSON,我们 hook 不到 metadata sidecar)
  • 没覆盖到的 call 在 ai-analysis 的 log 里有 e2e_attribution: 'no_propagation_field' tag;Layer 2 query 6 可以 count uncovered slice

判断标准没 baseline 之前不要 page — ship 1 周后看真实 p95 才设 alarm 阈值。


Layer 2 — CloudWatch Logs Insights(dashboard 不够时)

核心场景: 客户给你一个 phone 或 telephonySessionId, 你想看完整 timeline

必备 query: 跨 4 Lambda trace 一通通话

CloudWatch Logs Insights 选 4 个 log group: - /aws/lambda/call-analytics-prod-transcribe-processor-ts-us-east-1 - /aws/lambda/call-analytics-prod-ai-analysis-processor-ts-us-east-1 - /aws/lambda/call-analytics-prod-contacts-analyzer-us-east-1 - /aws/lambda/call-analytics-prod-message-processor-us-east-1

Query 1: 按 telephonySessionId 找完整 trace

fields @timestamp, @log, @message, telephonySessionId, rootCorrelationId, phone, storeId
| filter telephonySessionId = "session-abc123"
| sort @timestamp asc

→ 30 秒看到 4 段 pipeline 完整 timeline。@log 字段告诉你哪个 Lambda log 来的

Query 2: 按客户 phone 找 trace(没有 telephonySessionId)

fields @timestamp, @log, @message, telephonySessionId, phone
| filter phone = "+19142654371"
| sort @timestamp asc

Query 3: 按 webhook UUID 找 trace(从 RC webhook 起源)

fields @timestamp, @log, @message, rootCorrelationId, telephonySessionId
| filter rootCorrelationId = "webhook-uuid-xxx"
| sort @timestamp asc

rootCorrelationId = RC webhook UUID,4 段 Lambda 都带它

Query 4: 找所有今天 error

fields @timestamp, @log, @message, telephonySessionId, phone
| filter level = "ERROR"
| sort @timestamp desc
| limit 50

Query 5: 找 AI 慢调用(latency 异常)

fields @timestamp, telephonySessionId, model, AIInvokeDuration
| filter AIInvokeDuration > 30000
| sort AIInvokeDuration desc

→ 哪通通话 AI call > 30 秒。

Query 6: 找 contacts-analyzer 收到 SQS 但等很久才处理(AiToContactDelay)

fields @timestamp, telephonySessionId, AiToContactDelayMs
| filter AiToContactDelayMs > 60000
| sort AiToContactDelayMs desc

→ 哪通通话从 ai-analysis 完到 contacts-analyzer 拿到等了 > 60 秒(SQS 积压信号)。

Query 7: 跨 Lambda 重建一通通话完整 timeline(PR #745 — pipeline.milestone wide event)

fields @timestamp, milestone, telephony_session_id, webhook_received_at, transcribe_path, end_to_end_duration_ms, e2e_attribution
| filter @message like /pipeline.milestone/
| filter telephony_session_id = "s-abc123..."
| sort @timestamp asc

→ 拿到 3-4 条 milestone(每 Lambda 1 条),按时间顺序看清楚每段花了多久:

2026-05-04 10:00:05.123  transcribe_complete       transcribe_path=deepgram   webhook_received_at=...
2026-05-04 10:00:08.456  ai_analysis_complete                                 webhook_received_at=...
2026-05-04 10:00:11.789  task_processing_complete  end_to_end_duration_ms=11666

几个常见用法:

  • 客户问 "为什么这通通话 30 分钟还没 task" → query → 看哪个 milestone 没出现 → 那段卡了
  • 想 verify e2e SLO 这通是不是健康 → 末端 end_to_end_duration_ms 直接拿到
  • count 多少 calls 走 AWS Transcribe path(没覆盖 SLO)filter e2e_attribution = "no_propagation_field" | stats count()

Query 8: count #724 SLO coverage(每天有多少 calls 进 SLO 主指标 vs uncovered)

fields @timestamp, milestone, e2e_attribution
| filter @message like /pipeline.milestone/
| filter milestone in ["task_processing_complete", "task_processing_skipped", "ai_analysis_complete"]
| stats count(*) by milestone, e2e_attribution

→ 看 7 天内:

  • task_processing_complete + e2e_attribution=null count = 真正进 SLO 主指标的 calls
  • task_processing_skipped + e2e_attribution=no_task_generated count = contact-not-found 没生成 task(rollout gap)
  • ai_analysis_complete + e2e_attribution=no_propagation_field count = AWS Transcribe fallback 不可覆盖

如果 covered ratio 突然降到 < 90%,说明 transcribe-processor 那边 webhookReceivedAt 没传过来 — 可能上游 _routed_at 出问题或 webhook-parser 改坏了。

Tips

  • Insights query 最长保留 7 天(默认 retention)。要看更早,可以 query 更老 log group(retention 配置不同)
  • 复杂 query → 保存为 "Saved query"(右上角 save 按钮),团队 share
  • 一次 query 多 log group:左侧 "Select log group(s)" 多选

Layer 3 — Sentry(forensic + stack trace)

何时用 Sentry vs CloudWatch: - Sentry: 我有 Error.stack, 我要看是哪行代码炸 + breadcrumb 历史 - CloudWatch Logs: 我要看时间线 + 跨 Lambda trace + 自定义字段

进 Sentry

URL: https://retaintive.sentry.io

找事件

  • 按 telephonySessionId: search bar 输 tags:telephonySessionId:session-abc123
  • 按 phone: tags:phone:+19142654371
  • 按 Lambda: filter project = ai-analysis-processor / transcribe-processor / 等
  • 按 error type: filter error.type:ZodError / error.type:OneRouterError

看一条事件能告诉你

  • 完整 stack trace(哪行代码 throw)
  • Breadcrumb(error 之前几行 log)
  • Lambda invocation context(memoryLimit, executionTime, awsRequestId)
  • 关联的其他 events(同 telephonySessionId 的所有 errors)

"我现在该怎么 debug X" — 场景 cheatsheet

场景 A: 客户说 "我那通通话 30 分钟还没出 task"

1. Layer 1 dashboard:
   - SQS Queue Depth — 队列积压?
   - Pipeline Throughput — 哪段 success 数没动?

2. Layer 2 Logs Insights:
   - Query 1 — 用客户给的 sessionId trace 全链路
   - 找 last log entry 看停在哪个 Lambda

3. 如果 last log 是 ai-analysis:
   - Query 5 — 是不是 AI timeout / retry 爆炸
   - 看 AIInvokeDuration 异常没

4. 如果完全没 log(ai-analysis 都没收到):
   - 看上游 transcribe-processor log
   - 或更上游 ringcentralSubscriptionService(另一个 repo)

场景 B: dashboard 上 "Pipeline Failure Rate" spike

1. Layer 1: 哪段 spike?
   - 看 "Pipeline Failure Rate per Stage" 颜色(orange/red/purple/pink)

2. Layer 2: 找原因
   - filter failure timeframe + Lambda log
   - Query 4 找当时所有 ERROR log

3. Layer 3 Sentry:
   - filter project = 出问题的 Lambda
   - 按 timestamp 找 cluster of errors
   - 看是不是同一个 stack trace = 同一 bug

场景 C: "今天 AI cost 突然涨了"

1. Layer 1: dashboard "AI Cost per Model"
   - 哪个 model 涨?是 Grok 还是 DeepSeek?
   - "AI Tokens" 是不是同步涨?

2. Layer 2 Logs Insights:
   - 找 retry 异常多的 case
   - Query: filter model = "X" + retry > 1
   - 是不是 Zod validation 频繁失败导致重试

3. 看 "AI Avg Retry Attempts per Model" panel
   - DeepSeek retry 多 → 它输出 schema 不对的频率高
   - = AB test 上 DeepSeek 质量差

场景 D: 收到 Discord alert "DLQ has messages"

⚠️ 当前已知 issue(2026-05-04): monitoring-stack.ts 的 DLQ alarm 走 SnsAction → SNS topic 没真订阅者 → alarm 是 silent 的

实际上你不会收到 Discord alert about DLQ。修复见 issue #723 alarm-dispatcher

当前临时:每天上班手动看 SQS console DLQ depth。或看 dashboard "SQS Queue Depth" panel(需要进 dashboard,不会主动 page 你)。

场景 E: "AI 输出 JSON 解析失败"(Zod error)

1. Layer 2 Logs Insights:
   - Query: filter @message like /Zod validation failed/
   - 看 zodErrors 字段 → 哪个 field 解析失败

2. Layer 3 Sentry:
   - filter error.type:ZodError
   - 看 stack + breadcrumb 找 prompt 改动 / model 行为变化

3. Dashboard:
   - "AI Quality — Zod Validation Pass/Failure" 看趋势
   - 如果突然增多 = 模型出问题或 prompt 变化导致

场景 F: "Lambda timeout"

1. Dashboard "Lambda Duration p99" — 哪个 Lambda 接近 timeout?
2. Logs Insights:
   - filter @duration > 800000 (> 13 min, 接近 15 min timeout)
   - 看是哪通 call
3. 如果是 ai-analysis:
   - 大概率 AI call 自己 timeout(150s × 4 retry = 10 min)
   - Query 5 看 AIInvokeDuration
4. 如果是 contacts-analyzer:
   - cron 模式可能跑很多 contact
   - 看 batch size 或调 reservedConcurrency

Metric namespace 速查表

CloudWatch Metrics 的 namespace 决定你在哪找 metric。同 metric 名在不同 namespace 不通查

Namespace 谁发 主要 metric
CallAnalytics/RingCentral transcribe-processor + reconciliation-worker ProcessingDuration, ProcessingSuccess/Failure, SqsQueueWaitMs, NeonWriteDuration, FirstAttempt/RetryMessages, RateLimitErrors, TranscriptionDuration
CallAnalytics/AI ai-analysis-processor ProcessingDuration, AICostUsd, InputTokens, OutputTokens, AIInvokeDuration, NeonWriteDuration, OneRouterError, ZodValidationPass/Failure, AttemptCount
CallAnalytics/ContactsAnalyzer contacts-analyzer(PR #720 新加) ProcessingDuration, ProcessingSuccess/Failure, AIInvokeDuration, NeonWriteDuration, AiToContactDelayMs, end_to_end_duration_ms(PR #745 #724 — snake_case,首个新命名规范 metric)
CallAnalytics/MessageProcessor message-processor(PR #720 新加) ProcessingDuration, ProcessingSuccess/Failure, NeonWriteDuration
CallAnalytics/ConfigManager config-manager (admin path,少用)
AWS/Lambda AWS 自带 Errors, Throttles, ConcurrentExecutions, Duration, Invocations
AWS/SQS AWS 自带 ApproximateNumberOfMessagesVisible, ApproximateAgeOfOldestMessage

Dimension(filter 维度): - AI metric 多数有 Model (Grok-4-1-fast-reasoning / deepseek-v4-flash) → AB test 对比 - AI metric 部分有 AccountId, Franchise → 多租户分组(目前主要是 OrangeTheory) - Lambda metric 有 FunctionName - SQS metric 有 QueueName


"我从代码哪里看 metric / log 的实现"

如果你想知道 dashboard 一个 number 是从哪行代码 emit 的:

Dashboard panel Source code
AI Cost / Tokens lambda/ai-analysis-processor/src/utils/metrics.ts emitAICostMetrics
AI ProcessingDuration 同上 emitProcessingDurationMetrics
AI Quality(Zod / Retry) 同上 emitZodValidationMetric / emitRetryCountMetric (PR #722 新)
AIInvokeDuration lambda/ai-analysis-processor/src/infrastructure/onerouter/client.ts:148 emitAIInvokeDuration
NeonWriteDuration(ai) lambda/ai-analysis-processor/src/infrastructure/neon-repository.ts:311
Per-stage Success/Failure 4 个 Lambda 的 src/utils/metrics.ts 都有 emitProcessingMetrics
SqsQueueWaitMs(transcribe) lambda/transcribe-processor/src/record-processor.ts:298 emitSqsQueueWaitMs
AiToContactDelayMs lambda/contacts-analyzer/src/handler.ts:466 emitAiToContactDelayMs
OneRouterError lambda/ai-analysis-processor/src/handler.ts:840 emitOneRouterErrorMetrics
Lambda Errors / Throttles / Duration AWS 自带,无 emit code
SQS Queue Depth / Age AWS 自带,无 emit code

Dashboard widget 定义:lib/stacks/monitoring-stack.ts line 1633+


常见 confusion(踩过 3 次以上的坑)

"Dashboard 没数据"

  • 检查 1: 你看的是哪个 env? Dashboard 只在 prod 创建(if (!isProduction) return; guard)
  • 检查 2: metric namespace 对吗? 不在 CallAnalytics/RingCentral 的 metric 不会被那个 panel 显示
  • 检查 3: 时间范围? 默认 dashboard 是 3h, 调到 12h 或 24h

"Pre/test env 没 metric"

  • PR-2 #720 之后已修复 — 3 个 Lambda 移除了 non-prod early return,pre/test 也 emit metric
  • 但 pre/test 没 dashboard(只 prod 有)。你只能直接进 CloudWatch Metrics 控制台看
  • Logs Insights 全 env 都有

"看不到某个 telephonySessionId 的 trace"

  • transcribe-processor 不一定写 telephonySessionId(它先解析才有);用 webhookUuid → transcribe → 拿到 sessionId → 后续 Lambda
  • rootCorrelationId(=webhookUuid)trace 更稳

"Discord/Lark alert 没收到"

  • logger.error 才会触发,logger.warn 不会(故意)
  • 如果是 CloudWatch Alarm 触发的,目前 silent(issue #719/#723) — 不会 page 你

升级日志(本 runbook 维护)

日期 改动 备注
2026-05-04 初版,基于 PR #718 + #720 + #722 merged 11 dashboard panel + 7 metric helper + correlation 跨 4 Lambda
2026-05-05 加 §"webhook 到 task 总共要几分钟" + Query 7-8 + end_to_end_duration_ms metric(PR #745 / #724) 新 dashboard panel Pipeline E2E Latency p50/p95/p99 + pipeline.milestone wide-event log 跨 3 Lambda + Coverage ~98%(AWS Transcribe ~2% gap with attribution tag)

相关文档

Other runbooks(同 runbooks/ 目录)

Source repo(callytics-infrastructure)

Follow-up issues