如何debug一個正在運行的go進程

superops 發佈 2024-05-03T17:52:14.295317+00:00

可以kill進程,但是前提時kill的話保障能找到問題根源,否則現場丟失無法再次debug。我知道我的過程的問題大約在 23 小時 15 分鐘前開始,並且1390 minutes 與此幾乎完全一致!

背景

  • go進程內無集成pprof等debug工具包
  • go在生產環境運行存在死鎖
  • 可以kill進程,但是前提時kill的話保障能找到問題根源,否則現場丟失無法再次debug


關於此問題

通常可以通過更改程序代碼來進行調試。

這可以稱為檢測:添加調試檢測以幫助了解錯誤,然後再次運行有問題的操作。

檢測可以是「列印語句」,也可以是更優雅的方式,例如添加調試器斷點,甚至可以不加改動地構建您的代碼,但 要求編譯器添加調試符號。

但有時您遇到的問題可能很少發生,以至於您無法重建(並因此重新運行)二進位文件,而只能調試正在運行的進程。這篇文章是關於這種情況的,使用 Go。


選項 1:將調試器附加到正在運行的程序

您可以使用調試器(例如Delve)附加到現有進程。無需重新編譯或添加檢測。

假設我們進程的 PID 是4040133

$ sudo ./dlv attach 4040133
Type 'help' for list of commands.
(dlv) goroutines
... goroutines' state is dumped here ...

那很簡單!Delve 當然更強大:您可以設置斷點、觀察變量、逐步執行代碼等。

選項 2:當您可以看到進程的 stderr 時,使用堆棧跟蹤退出

Go 提供了這個開箱即用的好功能:當您向它發送 SIGQUIT 信號時,它會以堆棧轉儲退出。顯示所有 goroutine 的堆棧轉儲,因此您可以知道每個「線程」在接收時正在做什麼SIGQUIT

所以在實踐中,這個堆棧跟蹤對你來說真的很有價值。現在讓我們學習挖掘它。

4040133您可以通過運行(仍然假設我們的 PID 是)將轉儲寫入進程的 stderr :

$ kill -QUIT 4040133

在您運行程序的另一個終端(或者它寫入 stderr 的地方,也許 $ journalctl如果您的應用程式在 systemd 下運行)您會看到:

SIGQUIT: quit
PC=0x464ce1 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex()
        /usr/local/go/src/runtime/sys_linux_amd64.s:552 +0x21
runtime.futexsleep(0x7fff3a356560?, 0x441df3?, 0xc000032000?)
        /usr/local/go/src/runtime/os_linux.go:56 +0x36
runtime.notesleep(0xbf91d0)
        /usr/local/go/src/runtime/lock_futex.go:159 +0x87
runtime.mPark()
        /usr/local/go/src/runtime/proc.go:1447 +0x2a
runtime.stoplockedm()
        /usr/local/go/src/runtime/proc.go:2611 +0x65
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:3308 +0x3d
runtime.park_m(0xc0001251e0?)
        /usr/local/go/src/runtime/proc.go:3525 +0x14d
runtime.mcall()
        /usr/local/go/src/runtime/asm_amd64.s:425 +0x43

goroutine 1 [chan receive, 21508 minutes]:
github.com/function61/gokit/sync/taskrunner.(*Runner).Wait(...)
        /go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/sync/taskrunner/taskrunner.go:79
github.com/joonas-fi/joonas-sys/pkg/statusbar.logic({0x96f6f8, 0xc000030cc0})
        /workspace/pkg/statusbar/bar.go:150 +0x1bf
github.com/joonas-fi/joonas-sys/pkg/statusbar.Entrypoint.func1(0xc0001bc780?, {0xc28ab8?, 0x0?, 0x0?})
        /workspace/pkg/statusbar/bar.go:34 +0x25
github.com/spf13/cobra.(*Command).execute(0xc0001bc780, {0xc28ab8, 0x0, 0x0})
        /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0xc000187b80)
        /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
        /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:902
main.main()
        /workspace/cmd/jsys/main.go:42 +0x434

goroutine 17 [syscall, 21508 minutes]:
os/signal.signal_recv()
        /usr/local/go/src/runtime/sigqueue.go:168 +0x98
os/signal.loop()
        /usr/local/go/src/os/signal/signal_unix.go:23 +0x19
created by os/signal.Notify.func1.1
        /usr/local/go/src/os/signal/signal.go:151 +0x2a

goroutine 18 [chan receive, 21508 minutes]:
github.com/function61/gokit/os/osutil.CancelOnInterruptOrTerminate.func1()
        /go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/os/osutil/canceloninterruptorterminate.go:32 +0x4d
created by github.com/function61/gokit/os/osutil.CancelOnInterruptOrTerminate
        /go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/os/osutil/canceloninterruptorterminate.go:31 +0x10a

goroutine 19 [syscall, 4080 minutes]:
syscall.Syscall(0x0, 0x0, 0xc0000ea3e4, 0xc1c)
        /usr/local/go/src/syscall/asm_linux_amd64.s:20 +0x5
syscall.read(0xc000072060?, {0xc0000ea3e4?, 0x9?, 0xc0002e2ea0?})
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:696 +0x4d
syscall.Read(...)
        /usr/local/go/src/syscall/syscall_unix.go:188
internal/poll.ignoringEINTRIO(...)
        /usr/local/go/src/internal/poll/fd_unix.go:794
internal/poll.(*FD).Read(0xc000072060?, {0xc0000ea3e4?, 0xc1c?, 0xc1c?})
        /usr/local/go/src/internal/poll/fd_unix.go:163 +0x285
os.(*File).read(...)
        /usr/local/go/src/os/file_posix.go:31
os.(*File).Read(0xc00000e010, {0xc0000ea3e4?, 0x1?, 0x120?})
        /usr/local/go/src/os/file.go:119 +0x5e
bufio.(*Scanner).Scan(0xc0000e3ef8)
        /usr/local/go/src/bufio/scan.go:215 +0x865
github.com/joonas-fi/joonas-sys/pkg/statusbar.logic.func1({0x0?, 0x0?})
        /workspace/pkg/statusbar/bar.go:61 +0x89
github.com/function61/gokit/sync/taskrunner.(*Runner).Start.func1()
        /go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/sync/taskrunner/taskrunner.go:51 +0x45
created by github.com/function61/gokit/sync/taskrunner.(*Runner).Start
        /go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/sync/taskrunner/taskrunner.go:50 +0x105

goroutine 23 [chan receive, 1390 minutes]:
github.com/function61/gokit/sync/taskrunner.(*Runner).waitInternal.func2(...)
        /go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/sync/taskrunner/taskrunner.go:101
github.com/function61/gokit/sync/taskrunner.(*Runner).waitInternal(0xc0000b2140)
        /go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/sync/taskrunner/taskrunner.go:134 +0x30a
github.com/function61/gokit/sync/taskrunner.(*Runner).Done.func1.1()
        /go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/sync/taskrunner/taskrunner.go:63 +0x25
created by github.com/function61/gokit/sync/taskrunner.(*Runner).Done.func1
        /go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/sync/taskrunner/taskrunner.go:62 +0x5a

goroutine 34 [chan send, 1390 minutes]:
github.com/vishvananda/netlink.routeSubscribeAt.func2()
        /go/pkg/mod/github.com/vishvananda/netlink@v1.1.0/route_linux.go:1075 +0x453
created by github.com/vishvananda/netlink.routeSubscribeAt
        /go/pkg/mod/github.com/vishvananda/netlink@v1.1.0/route_linux.go:1037 +0x2f2

rax    0xca
rbx    0x0
rcx    0x464ce3
rdx    0x0
rdi    0xbf91d0
rsi    0x80
rbp    0x7fff3a356530
rsp    0x7fff3a3564e8
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x43c400
r13    0x0
r14    0xbf8940
r15    0x7fb0d47ba96c
rip    0x464ce1
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

關於轉儲的一件被低估的事情是它顯示了系統調用等待事件的時間!我知道我的過程的問題大約在 23 小時 15 分鐘前開始,並且1390 minutes 與此幾乎完全一致!

通過上面的堆棧轉儲,我能夠找出錯誤所在。

選項 3:當您看不到進程的 stderr 時,使用堆棧跟蹤退出

如果您不確定該過程的stderr去向,我建議您先看看是否有一個簡單的解決方案。假設您的進程 ID 是4040133. 查找文件描述符 #2(它始終是 stderr)以了解其stderr連接位置:

$ ls -al /proc/4040133/fd/2
l-wx------ 1 joonas joonas 64 Feb 20 19:37 /proc/4040133/fd/2 -> /home/joonas/.xsession-errors

在我的例子中,我的程序在 X.org 伺服器下運行,並且stderr簡單地寫入了我的 .xsession-errors文件。如果我早點意識到這一點,我就可以省去麻煩了。

由於當時我不確定stderr要寫入何處,所以我選擇了核選項。

(即使在您認為stderr沒有價值並將其重定向到/dev/null!!! 的情況下,這也有效!)

訣竅是針對$ strace您現有的流程並捕獲write(2, ...)系統調用。syscall的第一個參數write()文件描述符編號2再次表示stderr

所以附加到這個過程strace

$ sudo strace -p 2143770 -s 512 -ewrite 2> /tmp/strace.log

然後在另一個終端中,要求您的進程退出(這將觸發 Go 運行時寫入堆棧跟蹤,這通常會通過寫入代理結束丟棄/dev/null):

$ kill -QUIT 2143770

該進程現在將堆棧跟蹤轉儲到/dev/null,但它必須通過發出系統調用來完成,系統調用 strace會為您記錄。

當您查看日誌文件時/tmp/strace.log,它看起來像:

strace: Process 4040133 attached
--- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=2140770, si_uid=1000} ---
write(2, "SIGQUIT: quit", 13)           = 13
write(2, "\n", 1)                       = 1
write(2, "PC=", 3)                      = 3
write(2, "0x464ce1", 8)                 = 8
write(2, " m=", 3)                      = 3
write(2, "0", 1)                        = 1
write(2, " sigcode=", 9)                = 9
write(2, "0", 1)                        = 1
write(2, "\n", 1)                       = 1
write(2, "\n", 1)                       = 1
write(2, "goroutine ", 10)              = 10
write(2, "0", 1)                        = 1
write(2, " [", 2)                       = 2
write(2, "idle", 4)                     = 4
write(2, "]:\n", 3)                     = 3
write(2, "runtime.futex", 13)           = 13
write(2, "(", 1)                        = 1
write(2, ")\n", 2)                      = 2
write(2, "\t", 1)                       = 1
write(2, "/usr/local/go/src/runtime/sys_linux_amd64.s", 43) = 43
write(2, ":", 1)                        = 1
write(2, "552", 3)                      = 3
write(2, " +", 2)                       = 2
write(2, "0x21", 4)                     = 4
write(2, "\n", 1)                       = 1
write(2, "runtime.futexsleep", 18)      = 18
write(2, "(", 1)                        = 1
write(2, "0x7ffc12443b70", 14)          = 14
write(2, "?", 1)                        = 1
write(2, ", ", 2)                       = 2
write(2, "0x441df3", 8)                 = 8
write(2, "?", 1)                        = 1
write(2, ", ", 2)                       = 2
write(2, "0xc000036500", 12)            = 12
write(2, "?", 1)                        = 1
write(2, ")\n", 2)                      = 2
write(2, "\t", 1)                       = 1
write(2, "/usr/local/go/src/runtime/os_linux.go", 37) = 37
write(2, ":", 1)                        = 1
write(2, "56", 2)                       = 2
write(2, " +", 2)                       = 2
write(2, "0x36", 4)                     = 4
write(2, "\n", 1)                       = 1
write(2, "runtime.notesleep", 17)       = 17
write(2, "(", 1)                        = 1
write(2, "0xbfd370", 8)                 = 8
write(2, ")\n", 2)                      = 2
write(2, "\t", 1)                       = 1
... output snipped ...
write(2, "0xbfcae0", 8)                 = 8
write(2, "\n", 1)                       = 1
write(2, "r15    ", 7)                  = 7
write(2, "0x7ff9ba37ce03", 14)          = 14
write(2, "\n", 1)                       = 1
write(2, "rip    ", 7)                  = 7
write(2, "0x464ce1", 8)                 = 8
write(2, "\n", 1)                       = 1
write(2, "rflags ", 7)                  = 7
write(2, "0x286", 5)                    = 5
write(2, "\n", 1)                       = 1
write(2, "cs     ", 7)                  = 7
write(2, "0x33", 4)                     = 4
write(2, "\n", 1)                       = 1
write(2, "fs     ", 7)                  = 7
write(2, "0x0", 3)                      = 3
write(2, "\n", 1)                       = 1
write(2, "gs     ", 7)                  = 7
write(2, "0x0", 3)                      = 3
write(2, "\n", 1)                       = 1
+++ exited with 2 +++

這些是原始系統調用,因此您需要進行一些文本處理才能將其轉換回人類可讀的內容。

像這樣的腳本可能會對您有所幫助。

但基本思想是這樣的,讓我們先看前幾​行:

write(2, "SIGQUIT: quit", 13)           = 13
write(2, "\n", 1)                       = 1
write(2, "PC=", 3)                      = 3
write(2, "0x464ce1", 8)                 = 8
write(2, " m=", 3)                      = 3
write(2, "0", 1)                        = 1
write(2, " sigcode=", 9)                = 9
write(2, "0", 1)                        = 1
write(2, "\n", 1)                       = 1
write(2, "\n", 1)                       = 1
write(2, "goroutine ", 10)              = 10
write(2, "0", 1)                        = 1
write(2, " [", 2)                       = 2
write(2, "idle", 4)                     = 4
write(2, "]:\n", 3)                     = 3
write(2, "runtime.futex", 13)           = 13
write(2, "(", 1)                        = 1
write(2, ")\n", 2)                      = 2
write(2, "\t", 1)                       = 1
write(2, "/usr/local/go/src/runtime/sys_linux_amd64.s", 43) = 43
write(2, ":", 1)                        = 1
write(2, "552", 3)                      = 3
write(2, " +", 2)                       = 2
write(2, "0x21", 4)                     = 4
write(2, "\n", 1)                       = 1

只需獲取原始字符串,您甚至可以+在瀏覽器的 JS 控制台中將它們評估為 JavaScript 運算符,例如重新組合它們:

"SIGQUIT: quit" +
"\n" +
"PC=" +
"0x464ce1" +
" m=" +
"0" +
" sigcode=" +
"0" +
"\n" +
"\n" +
"goroutine " +
"0" +
" [" +
"idle" +
"]:\n" +
"runtime.futex" +
"(" +
")\n" +
"\t" +
"/usr/local/go/src/runtime/sys_linux_amd64.s" +
":" +
"552" +
" +" +
"0x21" +
"\n";

->

SIGQUIT: quit\nPC=0x464ce1 m=0 sigcode=0\n\ngoroutine 0 [idle]:\nruntime.futex()\n\t/usr/local/go/src/runtime/sys_linux_amd64.s:552 +0x21\n

然後用\n換行符和\t制表符替換:

SIGQUIT: quit
PC=0x464ce1 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex()
        /usr/local/go/src/runtime/sys_linux_amd64.s:552 +0x21

因此,即使在數據被發送到垃圾箱的情況下,我們也恢復了重要數據!

關鍵字: