Stuff Goes Bad:Erlang In Anger

General View

Reading the crash dump will be useful to figure out possible reasons for a node to die a posteriori. One way to get a quick look at things is to use recon’s erl_crashdump_analyzer.sh ⁴ and run it on a crash dump:

阅读crash dump 可以更好地找到节点崩溃的可能原因。一个快捷的方法就是使用recon的erl_crashdump_analyzer.sh ⁴，使用方法如下：

------------------------------------------------------------
$ ./recon/script/erl_crashdump_analyzer.sh erl_crash.dump
analyzing erl_crash.dump, generated on: Thu Apr 17 18:34:53 2014
Slogan: eheap_alloc: Cannot allocate 2733560184 bytes of memory
(of type "old_heap").
Memory:
===
processes: 2912 Mb
processes_used: 2912 Mb
system: 8167 Mb
atom: 0 Mb
atom_used: 0 Mb
binary: 3243 Mb
code: 11 Mb
ets: 4755 Mb
---
total: 11079 Mb
Different message queue lengths (5 largest differen t):
===
1 5010932
2 159
5 158
49 157
4 156
Error logger queue length:
===
0
File descriptors open:
===
UDP: 0
TCP: 19951
Files: 2
---
Total: 19953
Number of processes:
===
36496
Processes Heap+Sta ck memo ry siz es (wor ds) us ed in the VM (5 largest
different):
===
1 284745853
1 5157867
1 4298223
2 196650
12 121536
Processes OldHeap memory sizes (words) used in the VM (5 largest
different):
===
3 318187
9 196650
14 121536
64 75113
15 46422
Process States when crashing (sum):
===
1 Garbing
74 Scheduled
36421 Waiting
------------------------------------------------------------
This data dump won’t point out a problem directly to your face, but will be a good clue as to where to look. For example, the node here ran out of memory and had 11079 Mb out of 15 Gb used (I know this because that’s the max instance size we were using!) This can be a symptom of:
• memory fragmentation;
• memory leaks in C code or drivers;
• lots of memory that got to be garbage-collected before generating the crash dump ⁵.
More generally, look for anything surprising for memory there. Correlate it with the number of processes and the size of mailboxes. One may explain the other.
In this particular dump, one process had 5 million messages in its mailbox. That’s telling. Either it doesn’t match on all it can get, or it is getting overloaded. There are also dozens of processes with hundreds of messages queued up — this can point towards overload or contention. It’s hard to have general advice for your generic crash dump, but there still are a few pointers to help figure things out.

这些数据并不会把问题原因直接指明，但这可以做为一个非常好的线索。比如：上面节点现在内存溢出，有11079Mb/15Gb的被使用了(我之所以这样判断，是因为这是我们使用过的最多的内存的一次!)这可能是以下症状引起的：
• 内存碎片;
• C代码或驱动层引起的内存泄露;
• 在生成crash dump⁵前，大量的内存都进入了垃圾回收;
• 更常见的，根据进程的总数和信箱的大小，在那里可以找到任意有异常的内存。可能就是其中的某一个异常造成了内存暴增。

在这个特别的dump里面，一个进程的信箱里面有5百万个消息。这就可以看出，可能是所有的消息都不能匹配到代码，或它本身已越负荷运行了。还有很多进程有数以千计的消息在排队。这说明系统超负荷运行或资源竞争冲突了。对于一般的crash dump很难给出一个通用的指导原则，但还是有一些东西可以帮你找到问题。

[4] https://github.com/ferd/recon/blob/master/script/erl_crashdump_analyzer.sh
[5] Notably here is reference-counted binary memory, which sits in a global heap, but ends up being garbage-collected before generating the crash dump. The binary memory can therefore be underreported. See Chapter 7 for more details

[注4]: https://github.com/ferd/recon/blob/master/script/erl_crashdump_analyzer.sh
[注5]：特别的是，这里采用引用计数的二进制内存，他存放在全局堆中，但最终会在生成crash dump之前被垃圾回收掉。所以这个二进制内存可能看上去比实际使用的低。更多细节请看章节7.