协程泄漏排查与修复

# 15.协程泄漏排查与修复

卷三第十五篇——goroutine 是 Go 最廉价的并发单元，启动只需 ~2KB 栈。但也正因为它太廉价，程序员往往忘了它也需要"善后"——一个没关闭的 channel、一个没 Stop 的 Timer、一个没 Done 的 context、一个没 Close 的 HTTP Body——每一个都可能让 goroutine 永远活着。这是 Go 程序最隐蔽的内存泄漏形式。读完本篇，你能回答：为什么 pprof 的 goroutine profile 是排查泄漏的第一工具？leaktest 怎么在单元测试中捕获泄漏？五种典型泄漏场景的根因和修复模式分别是什么？关键词：goroutine 泄漏、channel 阻塞、timer 泄漏、pprof goroutine、leaktest。

# 目录介绍

1. 案例引入
2. 架构概览
- 2.1 goroutine 生命周期全景
- 2.2 泄漏的五种根因分类
3. Channel 阻塞泄漏
4. 无限循环泄漏
5. HTTP 与 Timer 资源泄漏
6. context 未取消泄漏
- 6.1 忘记 defer cancel
- 6.2 父 context 已取消但子 goroutine 未感知
7. pprof 诊断定位
8. leaktest 自动检测
9. 生产环境监控体系
10. 综合案例串讲

# 1. 案例引入

# 1.1 一段崩在哪

看一个 WebSocket 推送网关——它维护与客户端的长连接，每当后端有新的推送消息时，转发给对应的客户端。生产环境发布后，前两周一切正常，第三周开始监控告警：goroutine 数量从 200 一路涨到 50000，RSS 从 300MB 涨到 3.5GB，最终 K8s OOM Kill：

// ws_gateway.go —— WebSocket 推送网关
package main

import (
    "log"
    "net/http"
    "sync"
    "time"
    "github.com/gorilla/websocket"
)

var upgrader = websocket.Upgrader{
    ReadBufferSize:  1024,
    WriteBufferSize: 1024,
}

// 每个 WebSocket 连接——启动两个 goroutine
func handleWS(w http.ResponseWriter, r *http.Request) {
    conn, err := upgrader.Upgrade(w, r, nil)
    if err != nil {
        log.Printf("升级失败: %v", err)
        return
    }

    // goroutine 1：读协程——从客户端读取心跳
    go func() {
        for {
            _, msg, err := conn.ReadMessage()
            if err != nil {
                log.Printf("读失败: %v", err)
                return // ← 客户端断开 → 这里返回
            }
            log.Printf("收到客户端消息: %s", msg)
        }
    }()

    // goroutine 2：写协程——向客户端推送消息
    go func() {
        ticker := time.NewTicker(30 * time.Second)
        defer ticker.Stop()

        for {
            select {
            case msg := <-pushCh:
                if err := conn.WriteMessage(websocket.TextMessage, msg); err != nil {
                    log.Printf("写失败: %v", err)
                    return
                }
            case <-ticker.C:
                // 发送心跳
                if err := conn.WriteMessage(websocket.PingMessage, nil); err != nil {
                    log.Printf("心跳失败: %v", err)
                    return
                }
            }
        }
    }()
}

var pushCh = make(chan []byte) // 全局推送通道

func main() {
    http.HandleFunc("/ws", handleWS)
    log.Fatal(http.ListenAndServe(":8080", nil))
}

现象：

发布后前两天——goroutine 数 200~300 稳定，RSS 300MB
一周后——goroutine 数慢慢涨到 5000，RSS 涨到 800MB
两周后——goroutine 数突破 20000，RSS 1.5GB
第三周——goroutine 数突破 50000，RSS 3.5GB，OOM Kill
关键线索：goroutine 的增长和活跃 WebSocket 连接数不成比例——活跃连接只有 500 个，但 goroutine 有 50000 个

第一反应——查 pprof goroutine：

$ curl http://localhost:6060/debug/pprof/goroutine?debug=2 | grep -c "conn.ReadMessage"
# 49724 个 goroutine 全部卡在 conn.ReadMessage 上

49724 个 goroutine 在等客户端发消息——但其中 49000+ 对应的客户端早就断开了。为什么 conn.ReadMessage() 没有返回 error？因为客户端是异常断开（网络闪断、App 崩溃、切后台被系统回收）——TCP 连接在客户端的操作系统层面已经没了，但服务端的 WebSocket 连接还没有收到 FIN/RST 包（取决于网络中间设备的行为）。

在没有设置 ReadDeadline 的情况下——conn.ReadMessage() 会永远阻塞——等待一个永远不会到来的消息。

# 1.2 顺藤摸到根因

追查过程分三步：

第一步：逐一排查每个可能泄漏的 goroutine 源头——从 pprof 的栈信息中反查代码：

$ curl http://localhost:6060/debug/pprof/goroutine?debug=2 | grep -B3 "ReadMessage"
goroutine 12345 [IO wait, 384 hours]:
internal/poll.runtime_pollWait(0x7f..., 0x72)
    /usr/local/go/src/runtime/netpoll.go:345
net.(*netFD).Read(0xc000..., ...)
gorilla/websocket.(*Conn).ReadMessage()
    /app/handler.go:31  ← 第 31 行——handleWS 内的读协程

每个泄漏的 goroutine 都在同一个位置：conn.ReadMessage()——没有 timeout。

第二步：验证泄漏速度——用 goroutine profile 的快照对比：

# T1 时刻
$ curl -s http://localhost:6060/debug/pprof/goroutine?debug=1 | grep "count="
# goroutine profile: total 50123

# T2 时刻（10 分钟后）
$ curl -s http://localhost:6060/debug/pprof/goroutine?debug=1 | grep "count="
# goroutine profile: total 50387
# → 10 分钟涨了 264 个——每个新连接的读协程没退出，老连接的读协程也没退出

第三步：排查有没有其他泄漏点——除了 ReadMessage，还发现：

pushCh <- msg 的写协程——如果 pushCh 的消费者 goroutine 已经退出（因为读协程退出了），写协程也会阻塞
ticker.C 的心跳 ticker——在 goroutine 退出时 defer ticker.Stop() 保证了清理——但前提是 goroutine 能退出

这个事故藏着 7 个原理点：

① goroutine 在什么情况下"永不退出"——泄漏的五种根因分类？          → 第 2 章
② 向无缓冲 channel 发送但无接收者——为什么 goroutine 会永久阻塞？   → 第 3.1
③ channel 未关闭导致 for-range 永不退出？                          → 第 3.3
④ select {} 空循环和 for {} 死循环的区别？                         → 第 4 章
⑤ time.After 在 select 中为什么是内存泄漏源？                       → 第 5.2
⑥ pprof goroutine profile 怎么看泄漏现场——三件套解读？              → 第 7 章
⑦ leaktest 怎么在单元测试阶段就挡住泄漏？                            → 第 8 章

# 1.3 我们要回答什么

这个 WebSocket 网关案例贯穿全篇。我们从 goroutine 的生命周期出发，逐一解剖五种典型泄漏场景的根因和修复模式——再用 pprof goroutine profile 定位泄漏源头，用 leaktest 在 CI 中自动拦截，最后建立生产环境的监控和自愈机制。

本篇路线：

架构总图 (第 2 章) ── goroutine 退出路径 + 五种泄漏分类
   ↓
Channel 阻塞 (第 3 章) ── send/recv/未关闭——三种子模式 + select+ctx 修复范式
   ↓
无限循环 (第 4 章) ── for-range / select {} + 退出信号修复
   ↓
HTTP+Timer (第 5 章) ── Body 未关闭 / time.After 陷阱 / Ticker 泄漏
   ↓
context 未取消 (第 6 章) ── defer cancel / 父 ctx 取消子不可见
   ↓
pprof 诊断 (第 7 章) ── goroutine profile 三件套解读
   ↓
leaktest 检测 (第 8 章) ── 原理 + 集成 + 忽略已知
   ↓
生产监控 (第 9 章) ── Prometheus / 告警 / 自愈
   ↓
综合案例 (第 10 章) ── 修复网关 + 设计哲学

📌 本篇定位：第 13-14 篇讲的是"怎么正确使用并发控制"——加权信号量和 errgroup 是防患于未然。本篇讲的是"出了问题怎么排查和修复"——goroutine 泄漏是并发控制的反面教材。理解了泄漏的根因，回头再看 errgroup 的 WithContext 取消传播和 SetLimit 限并发——它们的设计动机就更清楚了：每一个特性都在堵一个泄漏的洞。

# 2. 架构概览

# 2.1 goroutine 生命周期全景

一个 goroutine 从创建到退出的完整路径：

go func() { ... }
        │
        ▼
runtime.newproc1() → 分配 G 对象 + 初始化 2KB 栈
        │
        ▼
G 进入 P 的本地队列 → 等待调度
        │
        ▼
M 绑定 P → 开始执行 G 的 fn
        │
        ├── fn 正常 return → 隐式调用 runtime.Goexit()
        │      → G 状态: _Grunning → _Gdead
        │      → G 对象回收、栈归还 mcache.stackcache
        │
        ├── fn 中调用 runtime.Goexit()
        │      → 同上——但 defer 会执行
        │
        └── fn 进入永久阻塞状态 → 泄漏
               → G 状态: _Grunning → _Gwaiting
               → G 永远不会变成 _Gdead
               → 栈不会归还——可能在后续 GC STW 时缩容
               → 但 G 对象本身永远存活 → goroutine 泄漏

goroutine 退出的唯一方式——fn 函数 return（包括 panic 被 recover 后 return）。没有"外部强制终止 goroutine"的 API——Go 的哲学是"协作式退出"。

泄漏的本质——goroutine 的 fn 永远无法 return。原因只有一种：fn 卡在某个阻塞操作上，而这个操作永远不会被唤醒。

# 2.2 泄漏的五种根因分类

按阻塞操作的种类，分为五类：

goroutine 泄漏
│
├─ 1. Channel 阻塞泄漏
│     ├─ 1.1 发送阻塞: ch <- v（无接收者）
│     ├─ 1.2 接收阻塞: <-ch（无发送者）
│     └─ 1.3 for-range channel（未关闭）
│
├─ 2. 无限循环泄漏
│     ├─ 2.1 for-range channel（channel 不关闭 + 无人发送）
│     ├─ 2.2 select {}（无任何 case 可触发）
│     └─ 2.3 for {} 死循环（无退出条件）
│
├─ 3. HTTP 与 Timer 资源泄漏
│     ├─ 3.1 http.Response.Body 未 Close
│     ├─ 3.2 time.After 在 select 中（无 Stop 导致 Timer 不释放）
│     └─ 3.3 time.Ticker 未 Stop
│
├─ 4. context 未取消泄漏
│     ├─ 4.1 忘记 defer cancel()——context.WithTimeout/WithCancel 的 cancel 未调用
│     └─ 4.2 父 ctx 已取消但子 goroutine 未感知（未检查 ctx.Done()）
│
└─ 5. 其他阻塞操作泄漏
      ├─ 5.1 sync.Mutex/Cond 等锁——忘记 Unlock 或 Wait 无 Signal
      └─ 5.2 系统调用阻塞——如未设置 deadline 的网络 I/O（本篇 WebSocket 案例）

每种泄漏的共同特征：goroutine 内部存在一个不会被唤醒的阻塞点。修复模式是统一的——给阻塞操作加上退出路径（select + ctx.Done / deadline / close signal）。

# 3. Channel 阻塞泄漏

# 3.1 发送阻塞模式

场景：向一个无缓冲（或已满的带缓冲）channel 发送数据——但接收方 goroutine 已经退出或不存在：

// ❌ 泄漏示例：发送方 goroutine 永远阻塞
func sendOnClosedConsumer() {
    ch := make(chan int) // 无缓冲

    // 消费者 goroutine
    go func() {
        v := <-ch // 接收一个值后退出
        fmt.Println(v)
    }()

    // 生产者 goroutine
    go func() {
        ch <- 1  // ← 成功——消费者在等
        ch <- 2  // ← 永久阻塞——消费者已经退出了
        fmt.Println("永远不会执行到这里")
    }()

    time.Sleep(100 * time.Millisecond)
}

// pprof 显示：
// goroutine 3 [chan send]:
// main.sendOnClosedConsumer.func2()
//     /app/main.go:17 +0x...

根因：G1 向 ch 发送——runtime.chansend 检查 recvq（接收者等待队列）→ 空 → G1 把自己挂到 sendq（发送者等待队列）→ 调用 gopark——状态变为 _Gwaiting。之后永远不会有人来接收——G1 永远沉睡。

goroutine 视角：

G1 在 ch <- 2 时：_Grunning → runtime.chansend → gopark → _Gwaiting
G1 对应的 sudog 挂在 ch.sendq 上
G1 需要另一个 G 调用 <-ch 来唤醒——但这个"另一个 G"永远不会出现 → 泄漏

# 3.2 接收阻塞模式

场景：从一个 channel 接收——但发送方 goroutine 已经退出或不存在：

// ❌ 泄漏示例：接收方 goroutine 永远阻塞
func recvFromNoSender() {
    ch := make(chan int)

    // 消费者 goroutine——没有生产者
    go func() {
        v := <-ch       // ← 永久阻塞——没有人往 ch 发数据
        fmt.Println(v)
    }()

    time.Sleep(100 * time.Millisecond)
}

// pprof 显示：
// goroutine 3 [chan receive]:
// main.recvFromNoSender.func1()

这个模式常出现在"结果 channel"模式中——生产者 goroutine 因其他错误提前退出了，但消费者还在等：

// ❌ 结果 channel 模式——生产者提前退出 → 消费者泄漏
func fetchAndProcess() {
    resultCh := make(chan *Result)

    go func() {
        // 生产者：调用 API
        resp, err := callAPI()
        if err != nil {
            log.Printf("API 失败: %v", err)
            return // ← 提前退出——resultCh 没有数据
        }
        resultCh <- resp
    }()

    go func() {
        // 消费者：等待结果
        result := <-resultCh // ← 永久阻塞——生产者没发数据就退出了
        process(result)
    }()
}

# 3.3 channel 未关闭模式

场景：用 for v := range ch 遍历 channel——channel 不关闭，goroutine 永不退出：

// ❌ 泄漏示例：for-range channel 不关闭
func forRangeNeverClose() {
    ch := make(chan int)

    go func() {
        for v := range ch { // ← 永远阻塞——ch 永远不会被关闭
            fmt.Println(v)
        }
        fmt.Println("永远不会执行到这里")
    }()

    ch <- 1
    ch <- 2
    ch <- 3
    // ← 忘记 close(ch)

    time.Sleep(100 * time.Millisecond)
}

// pprof 显示：
// goroutine 3 [chan receive]:
// main.forRangeNeverClose.func1()

为什么 close(ch) 能唤醒 for-range——close(ch) 会唤醒所有在 recvq 上等待的 G → for-range 的下一次循环收到零值 + ok=false → 退出循环。如果从不 close——recvq 上的 G 永远不醒来。

# 3.4 修复范式：select + ctx.Done

Channel 泄漏的统一修复模式——在阻塞操作旁加一个退出路径：

// ✅ 修复范式：select + ctx.Done() + 结果 channel
func sendWithCtx(ctx context.Context, ch chan<- int, v int) error {
    select {
    case ch <- v:
        return nil
    case <-ctx.Done():
        return ctx.Err() // ← 超时或取消时安全退出
    }
}

func recvWithCtx(ctx context.Context, ch <-chan int) (int, error) {
    select {
    case v := <-ch:
        return v, nil
    case <-ctx.Done():
        return 0, ctx.Err() // ← 退出路径
    }
}

// ✅ 用 close signal 替代 for-range 的无限等待
func processWithClose(ctx context.Context, ch <-chan int) {
    for {
        select {
        case v, ok := <-ch:
            if !ok {
                return // channel 已关闭 → 安全退出
            }
            fmt.Println(v)
        case <-ctx.Done():
            return // context 取消 → 安全退出
        }
    }
}

回到 WebSocket 案例——给 ReadMessage 加上 deadline：

// ✅ 修复：设置 ReadDeadline——避免永久阻塞
go func() {
    for {
        // 每次读取前设置 60 秒超时
        conn.SetReadDeadline(time.Now().Add(60 * time.Second))

        _, msg, err := conn.ReadMessage()
        if err != nil {
            if netErr, ok := err.(net.Error); ok && netErr.Timeout() {
                // 超时——发送 ping 检查连接是否还活着
                if err := conn.WriteMessage(websocket.PingMessage, nil); err != nil {
                    return // 连接真的死了 → 退出
                }
                continue
            }
            return // 非超时错误 → 退出
        }
        log.Printf("消息: %s", msg)
    }
}()

# 4. 无限循环泄漏

# 4.1 for-range channel 陷阱

第 3.3 节已经展示了 for-range 未关闭 channel 的泄漏。这里补充一个变体——"有数据但没有关闭信号"：

// ❌ 生产者还在发数据，但消费者永远等不到 close
func producerConsumerLeak() {
    ch := make(chan int, 100)

    // 消费者
    go func() {
        for v := range ch { // ← 当 ch 有数据时不会阻塞
            fmt.Println(v)  // 但当缓冲区空了 → 阻塞等待
        }                   // → 直到 close(ch) 才退出
    }()

    // 生产者——发送 10 次后退出
    go func() {
        for i := 0; i < 10; i++ {
            ch <- i
        }
        // close(ch) // ← 忘了！消费者会泄漏
    }()

    time.Sleep(time.Second)
}

正确的生产者退出模式——生产完必须 close：

// ✅ 生产者退出时 close channel——通知消费者"没更多数据了"
go func() {
    defer close(ch) // ← 关键
    for i := 0; i < 10; i++ {
        ch <- i
    }
}()

# 4.2 select {} 空循环

select {} 是 Go 中最极端的永久阻塞——比 ch <- v 更隐秘：

// ❌ select {} 空循环——goroutine 永不退出
go func() {
    // ... do some work ...
    select {} // ← 永久阻塞——所有 case 都是 nil channel
}()

这个模式常出现在代码重构遗留——原本有一个 case <- doneCh，但重构后把 doneCh 删了，留下了空的 select {}。

// ✅ 改成带超时的版本
go func() {
    // ... do some work ...
    <-ctx.Done() // ← 等待取消信号
}()

# 4.3 修复范式：退出信号

所有无限循环泄漏的修复模式都一样——注入一个可感知的退出信号：

// ✅ 范式：done channel 退出信号
func worker(done <-chan struct{}, tasks <-chan Task) {
    for {
        select {
        case task, ok := <-tasks:
            if !ok {
                return // tasks channel 关闭
            }
            process(task)
        case <-done:
            return // 外部通知退出
        }
    }
}

// ✅ 范式：context 退出信号（推荐——可以级联取消）
func workerCtx(ctx context.Context, tasks <-chan Task) {
    for {
        select {
        case task, ok := <-tasks:
            if !ok {
                return
            }
            process(task)
        case <-ctx.Done():
            return // context 取消
        }
    }
}

# 5. HTTP 与 Timer 资源泄漏

# 5.1 未关闭的 HTTP Response Body

这是 Go 最常见的泄漏源之一——HTTP Response 的 Body 如果不 Close，底层 TCP 连接不会被回收到连接池：

// ❌ 泄漏：resp.Body 未关闭
func fetchURL(url string) ([]byte, error) {
    resp, err := http.Get(url)
    if err != nil {
        return nil, err
    }
    // ← 忘记 defer resp.Body.Close()

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        return nil, err
    }
    return body, nil
}

为什么这是 goroutine 泄漏——http.Get 内部会启动一个 goroutine 等待响应体被完全消费。如果 resp.Body 没有 Close：

http.Get 内部 goroutine 等待 Body 被完全读取
  → Body 有 1MB 未读数据
  → io.ReadAll(resp.Body) 可能只读了前 512B（因为业务逻辑提前 return）
  → 底层的 TCP 连接被标记为"不可复用"
  → 内部 goroutine 等待 Body.Close() 或 Body 完全被读取
  → 如果两者都没发生 → 内部 goroutine 泄漏 + TCP 连接泄漏

// ✅ 修复：defer resp.Body.Close()
func fetchURLSafe(url string) ([]byte, error) {
    resp, err := http.Get(url)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close() // ← 必须

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        return nil, err
    }
    return body, nil
}

即使不需要 Body，也要 Close + 丢弃：

// ✅ 不关心 Body 也要 Close——否则连接不回收
func checkHealth(url string) error {
    resp, err := http.Get(url)
    if err != nil {
        return err
    }
    defer resp.Body.Close()          // ← Close 连接才能回收
    io.Copy(io.Discard, resp.Body)   // ← 丢弃 Body 内容——但 Close 前必须读完
    return nil
}

# 5.2 time.After 在 select 中的陷阱

time.After 返回的 <-chan Time 在到期之前不会被 GC 回收——即使 select 的其他 case 已经提前触发：

// ❌ 泄漏：time.After 在 select 中——Timer 不释放
func processWithTimeout(ch <-chan int) {
    for {
        select {
        case v := <-ch:
            fmt.Println(v)
        case <-time.After(3 * time.Second): // ← 每次循环创建一个新 Timer
            fmt.Println("timeout")
            // 如果 ch 的数据频率很高——这个 case 永远不会被选中
            // 但 time.After 的 Timer 要等 3 秒后才被 GC！
        }
    }
}

泄漏机制：

如果 ch 的数据频率是 1ms 一个——循环 1000 次 = 1000 个 time.After
每个 time.After 的 Timer 要等 3 秒后才过期——这 3 秒内 1000 个 Timer 都在内存中
每个 Timer 约 1KB → 1000 个 = 1MB——看起来不大，但在高频循环中会快速膨胀

// ✅ 修复：用 time.NewTimer + Stop ——复用同一个 Timer
func processWithTimeoutSafe(ch <-chan int) {
    timer := time.NewTimer(3 * time.Second)
    defer timer.Stop()

    for {
        timer.Reset(3 * time.Second) // ← 重置计时器——复用同一块内存
        select {
        case v := <-ch:
            fmt.Println(v)
        case <-timer.C:
            fmt.Println("timeout")
        }
    }
}

# 5.3 Ticker 未 Stop 的累积效应

time.NewTicker 会在底层创建一个定时器 goroutine——如果不 Stop，这个 goroutine 永不退出：

// ❌ 泄漏：Ticker 没有 Stop——内部 goroutine 泄漏
func periodicTask() {
    ticker := time.NewTicker(1 * time.Second)
    // ← 缺少 defer ticker.Stop()

    for range ticker.C {
        doWork()
        if someCondition {
            return // ← goroutine 退出——但 ticker 的内部 goroutine 还在跑
        }
    }
}

time.NewTicker 底层结构：

time.NewTicker(d):
  → 创建 runtimeTimer（堆上分配）
  → 交给 runtime.timer 模块管理
  → 内部有一个 goroutine（timerproc / timers）循环检查
  → 到时间后向 ticker.C channel 发送时间值

ticker.Stop():
  → 从 timer 堆中移除 runtimeTimer
  → 标记 ticker.C 不可用
  → runtimeTimer 被 GC 回收

累积效应：如果周期性创建 Ticker 但不 Stop——每次创建一个 runtimeTimer → timer 堆膨胀 → 内部 goroutine 遍历更大的堆 → CPU 增加。1000 个未 Stop 的 Ticker → timer 堆 1000 个元素 → 每次检查 O(n) → 性能退化。

# 6. context 未取消泄漏

# 6.1 忘记 defer cancel

context.WithTimeout / context.WithCancel 返回的 cancel 函数必须被调用——否则 context 内部维护的 goroutine 和资源不会被释放：

// ❌ 泄漏：忘记 defer cancel()——context 内部资源不释放
func handleRequest(r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
    // ← 忘记 defer cancel()

    result, err := doWork(ctx)
    if err != nil {
        return // ← cancel 没有被调用——context 的 timer 还在等 5 秒
    }
    process(result)
}

// ✅ 修复：defer cancel()——即使提前 return 也会释放
func handleRequestSafe(r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
    defer cancel() // ← 关键

    result, err := doWork(ctx)
    if err != nil {
        return
    }
    process(result)
}

忘记 cancel 的内部代价：

WithTimeout 内部创建了一个 time.AfterFunc（即一个 runtimeTimer）
如果 cancel() 没被调用 → runtimeTimer 会等到超时时间才被触发
高频调用中：每秒 1000 个请求 × 5 秒超时 = 5000 个活跃的 runtimeTimer → timer 堆膨胀

# 6.2 父 context 已取消但子 goroutine 未感知

疑惑：父 goroutine 的 context 超时了——子 goroutine 为什么没有自动退出？

论证：

Go 的 context 取消是"推"模式——ctx 取消时关闭 ctx.Done() channel。但子 goroutine 必须主动"拉"——通过 select 或检查 ctx.Err()。如果子 goroutine 的 fn 内部是一个阻塞操作且不接收 ctx，父 context 的取消对它无效。

// ❌ 父 ctx 取消——子 goroutine 不感知
func parentCancelsButChildNotAware(ctx context.Context) {
    go func() {
        // 子 goroutine——没有检查 ctx.Done()
        time.Sleep(10 * time.Minute) // ← ctx 取消后继续睡
        fmt.Println("父都死了，我还活着")
    }()

    // 父 goroutine——1 秒后 ctx 取消
    ctx2, cancel := context.WithTimeout(ctx, 1*time.Second)
    defer cancel()
    <-ctx2.Done()
    // ctx2 取消了——但子 goroutine 完全不知道
}

结论：context 取消不是"魔法"——它是协作式的。每个 goroutine 必须自己检查 ctx.Done()。不检查 → 泄漏。这也是第 14 篇 errgroup 中反复强调"fn 内部必须检查 ctx.Done()"的原因。

# 7. pprof 诊断定位

# 7.1 goroutine profile 三件套

pprof 是诊断 goroutine 泄漏的第一武器——三种输出模式各有用途：

# 模式 1：概览——goroutine 总数 + 分类计数
$ curl http://localhost:6060/debug/pprof/goroutine?debug=1
# goroutine profile: total 50123
# 48724 @ 0x... gorilla/websocket.(*Conn).ReadMessage ...
# 1200  @ 0x... net/http.(*conn).serve ...
# 199   @ 0x... runtime.gopark ...

# 模式 2：全量栈——每个 goroutine 的完整调用栈
$ curl http://localhost:6060/debug/pprof/goroutine?debug=2 > goroutine_full.txt

# 模式 3：交互分析——go tool pprof
$ go tool pprof http://localhost:6060/debug/pprof/goroutine
(pprof) top
Showing nodes accounting for 50123, 99.8% of 50200 total
      flat  flat%   sum%        cum   cum%
     48724 97.06% 97.06%      48724 97.06%  gorilla/websocket.(*Conn).ReadMessage
      1200  2.39% 99.45%       1200  2.39%  net/http.(*conn).serve

# 7.2 逐例分析泄漏现场

案例 1：pprof 识别 channel 发送阻塞泄漏：

$ curl -s .../debug/pprof/goroutine?debug=2 | grep -A5 "chan send"
goroutine 12345 [chan send, 384 hours]:
main.sendLoop.func1()
    /app/producer.go:42 +0x5f
# ↑ "[chan send]" 标签 → goroutine 在 ch <- v 上阻塞
# → 检查 42 行的 channel——接收方 goroutine 是否退出了？

案例 2：pprof 识别 channel 接收阻塞泄漏：

$ curl -s .../debug/pprof/goroutine?debug=2 | grep -A5 "chan receive"
goroutine 23456 [chan receive, 200 hours]:
main.recvLoop.func1()
    /app/consumer.go:23 +0x...
# ↑ "[chan receive]" 标签 → goroutine 在 <-ch 上阻塞
# → 检查 23 行的 channel——发送方 goroutine 是否退出了？

案例 3：pprof 识别 IO wait 泄漏：

goroutine 34567 [IO wait, 500 hours]:
internal/poll.runtime_pollWait(0x..., 0x72)
    /usr/local/go/src/runtime/netpoll.go:345
net.(*netFD).Read(...)
net.(*conn).Read(...)
net/http.(*persistConn).Read(...)
bufio.(*Reader).Read(...)
gorilla/websocket.(*Conn).ReadMessage()
    /app/handler.go:31
# ↑ "[IO wait]" → goroutine 在等网络数据——且等了 500 小时
# → 500 小时 ≈ 20 天——远超合理范围 → 泄漏

# 7.3 堆内存关联分析

goroutine 泄漏通常伴随堆内存泄漏——因为 goroutine 的闭包捕获的变量都逃逸到了堆。结合 heap profile 可以估算泄漏的内存总量：

$ go tool pprof http://localhost:6060/debug/pprof/heap
(pprof) top
# 如果看到大量的 bytes.Buffer / json.Encoder / http.Request 对象
# 且数量与泄漏的 goroutine 数量成正比 → 这就是泄漏 goroutine 关联的堆内存

# 关联分析：
# 48724 个泄漏 goroutine × 每个携带的闭包变量（~8KB）= ~380MB 堆内存泄漏

# 8. leaktest 自动检测

# 8.1 leaktest 原理

goleak（Uber 开源）是最常用的 goroutine 泄漏检测库——原理是对比测试前后的 goroutine 栈：

测试开始前 → goroutine 栈快照 A
  执行被测代码
测试结束后 → goroutine 栈快照 B

B - A = 测试期间新增但没有退出的 goroutine → 泄漏！

// 安装：go get go.uber.org/goleak

import (
    "testing"
    "go.uber.org/goleak"
)

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m)
}

func TestWorker(t *testing.T) {
    defer goleak.VerifyNone(t) // ← 测试结束时检查是否有泄漏

    // 执行被测代码——启动 worker goroutine
    ch := make(chan int)
    go func() {
        ch <- 42 // ← 如果没有接收方 → goroutine 泄漏
    }()
    // ← 忘记接收 ch 的值 → worker goroutine 永远不会退出
}
// → 测试 FAIL：发现 1 个泄漏的 goroutine（堆栈指向 ch <- 42）

# 8.2 集成到测试流程

func TestGracefulShutdown(t *testing.T) {
    defer goleak.VerifyNone(t)

    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()

    // 启动被测 goroutine
    go worker(ctx)

    // 模拟工作一段时间
    time.Sleep(100 * time.Millisecond)

    // 触发关闭
    cancel()

    // 等待 worker 退出
    time.Sleep(200 * time.Millisecond)

    // goleak 检查——所有 goroutine 应该都退出了
}

CI 集成——在 go test -race 基础上加上 goleak：

# Makefile 或 CI 配置
go test -race -count=1 ./...

# 8.3 忽略已知泄漏

有些 goroutine 是框架或 runtime 的持久 goroutine（如 net/http.Server、runtime/timer），需要排除：

func TestMain(m *testing.M) {
    opts := []goleak.Option{
        goleak.IgnoreTopFunction("net/http.(*Server).Serve"),         // HTTP server
        goleak.IgnoreTopFunction("internal/poll.runtime_pollWait"),    // 网络 poller
        goleak.IgnoreTopFunction("runtime.gopark"),                   // runtime 内部
    }
    goleak.VerifyTestMain(m, opts...)
}

最佳实践——在每个需要 goroutine 的单元测试中都加上 defer goleak.VerifyNone(t)。在日常开发中"每次跑测试就自动检测泄漏"——比在生产环境发现泄漏的代价低一万倍。

# 9. 生产环境监控体系

# 9.1 Prometheus 指标暴露

把 goroutine 数量暴露为 Prometheus 指标——结合 pprof 的运行时计数：

import (
    "net/http"
    "runtime"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    goroutineCount = prometheus.NewGauge(prometheus.GaugeOpts{
        Name: "go_goroutines_current",
        Help: "当前 goroutine 数量",
    })
)

func init() {
    prometheus.MustRegister(goroutineCount)
}

// 启动一个 goroutine 定期更新指标
func monitorGoroutines() {
    go func() {
        ticker := time.NewTicker(10 * time.Second)
        defer ticker.Stop()
        for range ticker.C {
            goroutineCount.Set(float64(runtime.NumGoroutine()))
        }
    }()
}

func main() {
    monitorGoroutines()
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":9090", nil)
}

# 9.2 告警阈值设定

指标	告警阈值	严重阈值	含义
`go_goroutines_current`	> 10000	> 50000	goroutine 泄漏——持续增长超过 30 分钟
`go_goroutines_current / 活跃连接数`	> 20	> 100	每个连接的平均 goroutine 数——超出正常范围说明有泄漏
`go_goroutines_current 增长率`	> 10/min	> 100/min	goroutine 增长速率——快速膨胀说明有大面积泄漏

Prometheus 告警规则：

groups:
- name: goroutine_leak
  rules:
  - alert: GoroutineLeakSuspected
    expr: go_goroutines_current > 10000
    for: 30m
    annotations:
      summary: "goroutine 泄漏——当前 {{ $value }} 个，已超过 30 分钟"

  - alert: GoroutineRapidGrowth
    expr: rate(go_goroutines_current[5m]) > 10
    for: 5m
    annotations:
      summary: "goroutine 快速膨胀——每分钟增长 {{ $value }} 个"

# 9.3 自愈机制设计

对于已知的泄漏模式，可以在代码中内置防御性 goroutine 退出：

// ✅ 防御性编程：保证 goroutine 有超时退出路径
func startWorkerWithGuard(ctx context.Context) {
    go func() {
        for {
            select {
            case task := <-taskCh:
                process(task)
            case <-ctx.Done():
                return // ← 外部取消 → 安全退出
            case <-time.After(5 * time.Minute):
                // 兜底保护：如果所有信号都失效 → 超时强制退出
                log.Printf("WARNING: worker 在 5 分钟内未收到任何信号——强制退出")
                return
            }
        }
    }()
}

K8s 层面的自愈——如果 goroutine 已经泄漏且内存持续增长，K8s 的 liveness probe + memory limit 会自动重启 Pod：

# K8s Deployment 配置
resources:
  limits:
    memory: "2Gi"        # ← 超过 2GB → OOM Kill → Pod 重启
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

但这种重启只是"治标"——重启后如果泄漏模式没变，goroutine 还会继续涨。根治靠代码修复，兜底靠 K8s 重启。

# 10. 综合案例串讲

# 10.1 案例真相揭晓

回到第 1 章 WebSocket 推送网关的七个疑问，逐条作答：

疑问	答案
① goroutine 在什么情况下"永不退出"——五种根因分类？	第 2 章：channel 阻塞 / 无限循环 / HTTP+Timer / context 未取消 / 其他阻塞——五类
② 向无缓冲 channel 发送但无接收者——为什么永久阻塞？	第 3.1：G 调用 gopark 挂到 ch.sendq——需要另一个 G 调用 <-ch 来唤醒——但这个 G 不存在
③ channel 未关闭导致 for-range 永不退出？	第 3.3：for-range 等待 close(ch) 来结束循环——不 close 就永远等
④ select {} 空循环和 for {} 的区别？	第 4.2：select {} 是永久阻塞（没有可触发的 case），for {} 是忙等（占满 CPU）
⑤ time.After 为什么是内存泄漏源？	第 5.2：每次调用创建一个新 Timer——Timer 到期前不会被 GC——高频循环中累积几千个 Timer
⑥ pprof 怎么定位泄漏？	第 7 章：debug=1 看分类计数 → debug=2 看具体栈 → 结合 heap profile 估算内存泄漏
⑦ leaktest 怎么在单元测试挡住泄漏？	第 8 章：测试前后快照对比——新增未退出的 goroutine → goleak.VerifyNone

案例完整根因链条：

WebSocket 连接处理 goroutine:
  读协程: conn.ReadMessage() — 未设置 ReadDeadline
    → 客户端异常断开（App 崩溃 / 网络闪断）
    → TCP 连接在客户端 OS 层面已死——但服务端感知不到（无 FIN/RST）
    → conn.ReadMessage() 永远阻塞在 runtime_pollWait 上
    → 读协程永远不退出 → 2KB 栈 + 闭包变量（~8KB/goroutine）永久占用

  写协程: select { msg := <-pushCh / <-ticker.C }
    → 读协程泄漏后——连接无人维护
    → 写协程在下次 write 时会失败 → return —— 正常退出
    → 所以写协程不泄漏——只有读协程泄漏

  后果:
    → 每个泄漏的读协程: 2KB 栈 + ~8KB 闭包变量 = ~10KB/goroutine
    → 49000 个 × 10KB ≈ 490MB 堆栈内存
    → 加上 goroutine profile 的元数据开销 → 实际 RSS 增长远超此数
    → 3 周累计 50000 个泄漏 goroutine → RSS 3.5GB → OOM Kill

修复方案：

// ✅ 修复：ReadDeadline + context 双重保护
func handleWSFixed(w http.ResponseWriter, r *http.Request) {
    conn, err := upgrader.Upgrade(w, r, nil)
    if err != nil {
        return
    }
    defer conn.Close()

    ctx, cancel := context.WithCancel(r.Context())
    defer cancel()

    // 读协程——有超时保护
    go func() {
        defer cancel() // 读协程退出 → cancel ctx → 通知写协程也退出
        for {
            // ✅ 每次读取前设置 60 秒超时
            conn.SetReadDeadline(time.Now().Add(60 * time.Second))

            _, msg, err := conn.ReadMessage()
            if err != nil {
                return // 超时或连接断开 → 安全退出
            }
            handleMessage(msg)
        }
    }()

    // 写协程——有 ctx 保护
    go func() {
        ticker := time.NewTicker(30 * time.Second)
        defer ticker.Stop()

        for {
            select {
            case msg := <-pushCh:
                if err := conn.WriteMessage(websocket.TextMessage, msg); err != nil {
                    return
                }
            case <-ticker.C:
                if err := conn.WriteMessage(websocket.PingMessage, nil); err != nil {
                    return
                }
            case <-ctx.Done(): // ✅ ctx 取消 → 安全退出
                return
            }
        }
    }()
}

# 10.2 一次 goroutine 泄漏的完整排查旅程

现象: goroutine 数量从 200 → 50000（3 周），RSS 300MB → 3.5GB
─────────────────────────────────────────────────────────
        │
        ├─ 第一步：确认泄漏
        │    $ curl .../debug/pprof/goroutine?debug=1
        │    → goroutine profile: total 50123
        │    → 正常基线是 200～500 → 确认泄漏
        │
        ├─ 第二步：定位泄漏源
        │    $ curl .../debug/pprof/goroutine?debug=2 | grep "handler.go:"
        │    → 49724 个 goroutine 卡在 handler.go:31
        │    → 第 31 行: conn.ReadMessage()
        │    → 泄漏源确认：未设置 ReadDeadline 的 WebSocket 读操作
        │
        ├─ 第三步：验证泄漏机制
        │    在测试环境复现：建立连接 → 杀掉客户端进程
        │    → 服务端 goroutine 数 +2（读写各一）
        │    → 读 goroutine 在 10 分钟后仍未退出
        │    → 写 goroutine 在下一次 write 时退出
        │    → 确认：只有读 goroutine 泄漏
        │
        ├─ 第四步：估算泄漏量
        │    每个读 goroutine：2KB 栈 + ~8KB 闭包变量 = ~10KB
        │    49724 × 10KB ≈ 497MB 直接泄漏
        │    加上 goroutine 元数据 + GC 开销 → 实际 RSS 远超此值
        │
        ├─ 第五步：修复
        │    conn.SetReadDeadline(time.Now().Add(60 * time.Second))
        │    + ctx 双重保护
        │
        ├─ 第六步：验证修复
        │    发布后 24 小时：goroutine 数稳定在 200～300
        │    RSS 稳定在 300MB
        │    goleak 单元测试通过
        │
        └─ 第七步：防线建设
             CI: goleak.VerifyNone 拦截新泄漏
             监控: Prometheus alert goroutine > 10000
             自愈: K8s memory limit 2GB → OOM Kill 兜底重启

# 10.3 设计哲学回扣

哲学 1：Go 没有"外部强制终止 goroutine"——协作式退出的根本原因

每一篇并发相关的文章都回到了这个哲学：Go 没有 killGoroutine() 或 Thread.stop()。原因有两层——(1) 强制终止无法执行 defer（资源泄漏：锁、文件、连接），(2) 强制终止可能在 GC/写屏障中间执行（破坏 runtime 内部状态）。所以 goroutine 泄漏的本质是"退出的协作链路断裂"——修复的方法永远是修复链路（加 timeout / ctx.Done / close signal），而不是强行终止。

哲学 2：goroutine 泄漏不是"内存泄漏"——是"生命周期的泄漏"

goroutine 泄漏时占用的是 2KB 栈 + 闭包变量。虽然量不大（每个 ~10KB），但它是一直在线的活跃内存——GC 不会回收它，P 可能还会调度它（浪费 CPU 时间片检查阻塞状态）。10000 个泄漏 goroutine 不只是 100MB 内存——还有 10000 个 G 对象在网络 poller 中注册、在 scheduler 的 runq 之间穿梭、在 GC STW 时被扫描。一个泄漏的 goroutine 是活的——它的内存和 CPU 开销也是活的。

哲学 3：pprof goroutine profile 比 heap profile 更早暴露泄漏

goroutine 泄漏的第一个信号是 goroutine 数量增长——这比 RSS 增长更早、更敏感。goroutine 数量从 200 涨到 5000 可能只花了 1 天（RSS 从 300MB 到 500MB——差距不明显），但从 5000 涨到 50000 会拉动 RSS 暴涨。监控 goroutine 数量是第一道防线——在 RSS 告警之前就能发现问题。

哲学 4：单元测试中的 leaktest 是最低成本的防护墙

在生产环境发现 goroutine 泄漏的成本 = 线上事故（OOM Kill、P99 延迟暴涨、用户投诉）。在 CI 中发现泄漏的成本 = 一行 defer goleak.VerifyNone(t)。把检测左移到单元测试——是新写并发代码的最佳实践。

# 10.4 速查表

五种泄漏模式速查：

泄漏模式	代码特征	pprof 标签	修复方式
Channel 发送阻塞	`ch <- v`	`[chan send]`	`select { case ch <- v: case <-ctx.Done(): }`
Channel 接收阻塞	`<-ch`	`[chan receive]`	`select { case v := <-ch: case <-ctx.Done(): }`
for-range channel 不关闭	`for v := range ch`	`[chan receive]`	生产者 defer close(ch) 或 select 退出
time.After 在 select 中	`case <-time.After(d)`	`[select]`	用 `time.NewTimer` + `Reset` 复用
Ticker 未 Stop	`time.NewTicker(d)`	timer goroutine 不退出	`defer ticker.Stop()`
HTTP Body 未关闭	`http.Get(url)`	`[IO wait]`	`defer resp.Body.Close()` + `io.Copy(io.Discard, resp.Body)`
context 未 cancel	`context.WithTimeout`	context timer goroutine	`defer cancel()`
select {} 空循环	`select {}`	`[select]`	改成 `<-ctx.Done()`
网络 I/O 无 deadline	`conn.Read()`	`[IO wait]`	`conn.SetReadDeadline(time.Now().Add(d))`

pprof 诊断三件套：

命令	用途
`curl .../debug/pprof/goroutine?debug=1`	goroutine 总数 + 分类计数——快速判断是否泄漏
`curl .../debug/pprof/goroutine?debug=2`	每个 goroutine 的完整调用栈——定位泄漏源
`go tool pprof .../debug/pprof/goroutine`	交互式分析——top/list/tree/peek

防御体系分层：

层级	工具	时机
开发阶段	`goleak.VerifyNone(t)`	每个单元测试
CI 阶段	`go test -race` + goleak	每次 MR
预发环境	pprof goroutine profile 对比	压力测试
生产监控	Prometheus `go_goroutines_current`	实时
生产自愈	K8s memory limit + liveness probe	兜底

下一篇：我们已经掌握了 goroutine 泄漏的五种模式、pprof 定位方法和 leaktest 防线，下一步进入 16.并发设计模式详解 (opens new window)——把 Pipeline、Fan-in/Fan-out、Worker Pool、Or-Done、Tee 等经典模式的实现剖开。

上次更新: 2026/06/28, 17:55:19

← errgroup并行控制并发设计模式详解→