Stuff Goes Bad:Erlang In Anger

General View

Reading the crash dump will be useful to figure out possible reasons for a node to die a posteriori. One way to get a quick look at things is to use recon’s 4 and run it on a crash dump:

阅读crash dump 可以更好地找到节点崩溃的可能原因。一个快捷的方法就是使用recon的 4,使用方法如下:

$ ./recon/script/ erl_crash.dump
analyzing erl_crash.dump, generated on: Thu Apr 17 18:34:53 2014
Slogan: eheap_alloc: Cannot allocate 2733560184 bytes of memory
(of type "old_heap").
processes: 2912 Mb
processes_used: 2912 Mb
system: 8167 Mb
atom: 0 Mb
atom_used: 0 Mb
binary: 3243 Mb
code: 11 Mb
ets: 4755 Mb
total: 11079 Mb
Different message queue lengths (5 largest differen t):
1 5010932
2 159
5 158
49 157
4 156
Error logger queue length:
File descriptors open:
UDP: 0
TCP: 19951
Files: 2
Total: 19953
Number of processes:
Processes Heap+Sta ck memo ry siz es (wor ds) us ed in the VM (5 largest
1 284745853
1 5157867
1 4298223
2 196650
12 121536
Processes OldHeap memory sizes (words) used in the VM (5 largest
3 318187
9 196650
14 121536
64 75113
15 46422
Process States when crashing (sum):
1 Garbing
74 Scheduled
36421 Waiting
 This data dump won’t point out a problem directly to your face, but will be a good clue as to where to look. For example, the node here ran out of memory and had 11079 Mb out of 15 Gb used (I know this because that’s the max instance size we were using!) This can be a symptom of:
 • memory fragmentation;
 • memory leaks in C code or drivers;
 • lots of memory that got to be garbage-collected before generating the crash dump 5.
 More generally, look for anything surprising for memory there. Correlate it with the number of processes and the size of mailboxes. One may explain the other.
 In this particular dump, one process had 5 million messages in its mailbox. That’s telling. Either it doesn’t match on all it can get, or it is getting overloaded. There are also dozens of processes with hundreds of messages queued up — this can point towards overload or contention. It’s hard to have general advice for your generic crash dump, but there still are a few pointers to help figure things out.

 • 内存碎片;
 • C代码或驱动层引起的内存泄露;
 • 在生成crash dump5前,大量的内存都进入了垃圾回收;
 • 更常见的,根据进程的总数和信箱的大小,在那里可以找到任意有异常的内存。可能就是其中的某一个异常造成了内存暴增。

 在这个特别的dump里面,一个进程的信箱里面有5百万个消息。这就可以看出,可能是所有的消息都不能匹配到代码,或它本身已越负荷运行了。还有很多进程有数以千计的消息在排队。这说明系统超负荷运行或资源竞争冲突了。对于一般的crash dump很难给出一个通用的指导原则,但还是有一些东西可以帮你找到问题。

[5] Notably here is reference-counted binary memory, which sits in a global heap, but ends up being garbage-collected before generating the crash dump. The binary memory can therefore be underreported. See Chapter 7 for more details

[注5]:特别的是,这里采用引用计数的二进制内存,他存放在全局堆中,但最终会在生成crash dump之前被垃圾回收掉。所以这个二进制内存可能看上去比实际使用的低。更多细节请看章节7.