Stuff Goes Bad:Erlang In Anger

Full Mailboxes

For loaded mailboxes, looking at large counters is the best way to do it. If there is one large mailbox, go investigate the process in the crash dump. Figure out if it’s happening because it’s not matching on some message, or overload. If you have a similar node running, you can log on it and go inspect it. If you find out many mailboxes are loaded, you may want to use recon’s queue_fun.awk to figure out what function they’re running at the time of the crash:

对于大量累积的信箱，最好的方法就是查看消息数量。如果发现一个信箱特别多的消息，那么就去在crash dump里面研究这个进程，看看它是不是因为没有匹配到一些消息或进程超负荷运行。如果你还有一个环境类似但还在在运行着的节点，你就可以登录上去，检查那个进程。如果你发现很多信箱都超载了。你就可以使用recon’s queue_fun.awk来找出它们在crash时运行了什么函数。

-------------------------------------------------------------------
1 $ awk -v threshold=10000 -f queue_fun.awk /path/to/erl_crash.dump
2 MESSAGE QUEUE LENGTH: CURRENT FUNCTION
3 ======================================
4 10641: io:wait_io_mon_reply/2
5 12646: io:wait_io_mon_reply/2
6 32991: io:wait_io_mon_reply/2
7 2183837: io:wait_io_mon_reply/2
8 730790: io:wait_io_mon_reply/2
9 80194: io:wait_io_mon_reply/2
10 ...
-------------------------------------------------------------------
This one will run over the crash dump and output all of the functions scheduled to run for processes with at least 10000 messages in their mailbox. In the case of this run, the script showed that the entire node was locking up waiting on IO for io:format/2 calls, for example.

这可以遍历crash dump然后把信箱超过10000消息的进程使用的函数都打印出来。比如上面这例子就显示出整个节点都被锁住并等待IO使用io:format/2调用。