Reading the crash dump will be useful to figure out possible reasons for a node to die a posteriori. One way to get a quick look at things is to use recon’s erl_crashdump_analyzer.sh 4
and run it on a crash dump:
------------------------------------------------------------
$ ./recon/script/erl_crashdump_analyzer.sh erl_crash.dump
analyzing erl_crash.dump, generated on: Thu Apr 17 18:34:53 2014
Slogan: eheap_alloc: Cannot allocate 2733560184 bytes of memory
(of type "old_heap").
Memory:
===
processes: 2912 Mb
processes_used: 2912 Mb
system: 8167 Mb
atom: 0 Mb
atom_used: 0 Mb
binary: 3243 Mb
code: 11 Mb
ets: 4755 Mb
---
total: 11079 Mb
Different message queue lengths (5 largest differen t):
===
1 5010932
2 159
5 158
49 157
4 156
Error logger queue length:
===
0
File descriptors open:
===
UDP: 0
TCP: 19951
Files: 2
---
Total: 19953
Number of processes:
===
36496
Processes Heap+Sta ck memo ry siz es (wor ds) us ed in the VM (5 largest
different):
===
1 284745853
1 5157867
1 4298223
2 196650
12 121536
Processes OldHeap memory sizes (words) used in the VM (5 largest
different):
===
3 318187
9 196650
14 121536
64 75113
15 46422
Process States when crashing (sum):
===
1 Garbing
74 Scheduled
36421 Waiting
------------------------------------------------------------
This data dump won’t point out a problem directly to your face, but will be a good clue as to where to look. For example, the node here ran out of memory and had 11079 Mb out of 15 Gb used (I know this because that’s the max instance size we were using!) This can be a symptom of:
• memory fragmentation;
• memory leaks in C code or drivers;
• lots of memory that got to be garbage-collected before generating the crash dump 5.
More generally, look for anything surprising for memory there. Correlate it with the
number of processes and the size of mailboxes. One may explain the other.
In this particular dump, one process had 5 million messages in its mailbox. That’s
telling. Either it doesn’t match on all it can get, or it is getting overloaded. There are
also dozens of processes with hundreds of messages queued up — this can point towards
overload or contention. It’s hard to have general advice for your generic crash dump, but
there still are a few pointers to help figure things out.
[4] https://github.com/ferd/recon/blob/master/script/erl_crashdump_analyzer.sh
[5] Notably here is reference-counted binary memory, which sits in a global heap, but ends up being garbage-collected before generating the crash dump. The binary memory can therefore be underreported. See Chapter 7 for more details