Stuff Goes Bad:Erlang In Anger


By all means, processes are an important part of a running Erlang system. And because they’re so central to everything that goes on, there’s a lot to want to know about them. Fortunately, the VM makes a lot of information available, some of which is safe to use, and some of which is unsafe to use in production (because they can return data sets large enough that the amount of memory copied to the shell process and used to print it can kill the node).
All the values can be obtained by calling process_info(Pid, Key) or process_info(Pid, [Keys]) 17. Here are the commonly used keys 18:


 1. dictionary returns all the entries in the process dictionary 19. Generally safe to use, because people shouldn’t be storing gigabytes of arbitrary data in there.
 2. group_leader the group leader of a process defines where IO (files, output of io:format/1-3) goes. 20
 3. registered_name if the process has a name (as registered with erlang:register/2), it is given here.
 4. status the nature of the process as seen by the scheduler. The possible values are:
 a). exiting the process is done, but not fully cleared yet;
 b). waiting the process is waiting in a receive ... end;
 c). running self-descriptive;
 d). runnable ready to run, but not scheduled yet because another process is running;
 e). garbage_collecting self-descriptive;
 f). suspended whenever it is suspended by a BIF, or as a back-pressure mechanism because a socket or port buffer is full. The process only becomes runnable again once the port is no longer busy.

 1.dictionary 返回进程中所有的进程字典值19。通常使用都安全,因为开发都不会把GB级的数据入到进程字典里面。
 2.group_leader 进程所属于的组(group),定义在IO输入输出的(files,使用io:format/1-3输出的地方)。
 3.status 被调度器使用的进程属性值,可能的值如下:
  a) exiting 进程工作已完成,但是还没有全部被清理掉;
  b) waiting 进程工作处于receive....end;
  c) running 自我描述(self-descriptive);
  d) runnable 进程已准备可以运行,但是还没有被分配,因为另一个进程在正在运行中(running);
  e) garbage_collection自我描述(self-descriptive);
  f) suspended 不论它是被BIF挂起,还是socket或port buffer都满负荷时被挂起的。这个进程只能在port有空时才会转为runnable状态。

links will show a list of all the links a process has towards other processes and also ports (sockets, file descriptors). Generally safe to call, but to be used with care on large supervisors that may return thousands and thousands of entries.
monitored_by gives a list of processes that are monitoring the current process (through the use of erlang:monitor/2).
monitors kind of the opposite of monitored_by; it gives a list of all the processes being monitored by the one polled here.
trap_exit has the value true if the process is trapping exits, false otherwise.

link 会显示写指定进程所links的所有其它进程列表,包括ports(sockets,file descriptors)。一般都是可以安全调用的,但在一个监控着成千上万个进程的大监控树进程来说,使用时要备加小心。
monitored-by 会返回监控本进程的所有进程列表(使用erlang:monitor/2做的监控)。
monitors 与monitored_by相反,返回指定进程监控的所有进程列表。

current_function displays the current running function, as a tuple of the form {Mod, Fun, Arity}.
current_location displays the current location within a module, as a tuple of the form {Mod, Fun, Arity, [{File, FileName}, {line, Num}]}.
current_stacktrace more verbose form of the preceding option; displays the current stacktrace as a list of ’current locations’.
initial_call shows the function that the process was running when spawned, of the form {Mod, Fun, Arity}. This may help identify what the process was spawned as, rather than what it’s running right now.

current_function 显示当前运行的函数:返回值为{Mod, Fun, Arity}的元组。
current_location 显示当前运行的函数模块位置。返回值为:{Mod, Fun, Arity, [{File, FileName}, {line, Num}]}。
current_stacktrace 前缀选项里面一个非常详细选项;显示'current_locations'列表的当前堆栈(stacktrace)。
initial_call 显示进程运行spawned时初始化运行的函数:{Mod, Fun, Arity}.这可以用于帮助定位进程初始化时运行的函数,而不是进程当前运行的函数。

Memory Used
binary Displays the all the references to refc binaries 21 along with their size. Can be unsafe to use if a process has a lot of them allocated.
garbage_collection contains information regarding garbage collection in the process. The content is documented as ’subject to change’ and should be treated as such. The information tends to contains entries such as the number of garbage collections the process has went through, options for full-sweep garbage collections, and heap sizes.
heap_size A typical Erlang process contains an ’old’ heap and a ’new’ heap, and goes through generational garbage collection. This entry shows the process’ heap size for the newest generation, and it usually includes the stack size. The value returned is in words.
memory Returns, in bytes, the size of the process, including the call stack, the heaps, and internal structures used by the VM that are part of a process.
message_queue_len Tells you how many messages are waiting in the mailbox of a process.
messages Returns all of the messages in a process’ mailbox. This attribute is extremely dangerous to request in production because mailboxes can hold millions of messages if you’re debugging a process that managed to get locked up. Always call for the message_queue_len first to make sure it’s safe to use.
total_heap_size Similar to heap_size, but also contains all other fragments of the heap, including the old one. The value returned is in words.

Memory Used
binary Displays 所有的refc 类型的二进制的引用21和他的占用空间大小。如果进程有大量的refc类型的二进程,调用它就会非常不安全。
garbage_collection 进程关于垃圾回收的相关信息。文档中指出这个参数会有所变化('subject to change'),这些信息将来会包括进程所使用垃圾回收的次数,还有堆的大小。返回值为words类型。
heap_size一个典型的Erlang进程会包括‘old' heap和’new' heap和经过的垃圾回收器。这里只返回最新的堆大小和堆的大小,返回值为word类型。
memory 返回值为bytes类型,进程占用的内存大小,包含所有调用堆栈。
message_queue_len 进程中信箱消息个数总数。
messages 返回进程信箱中所有的信息。这个参数在生产环境中使用起来非常危险,因为在你一步步调试进程时,会把进程锁住,信箱可能会存着成千上万的消息。使用之前,请先使用message_queue_len 来确定下消息数量是不是很多。

reductions The Erlang VM does scheduling based on reductions, an arbitrary unit of work that allows rather portable implementations of scheduling (time-based scheduling is usually hard to make work efficiently on as many OSes as Erlang runs on). The higher the reductions, the more work, in terms of CPU and function calls, a process is doing.
Fortunately, for all the common ones that are also safe, recon contains the recon:info/1 function to help:

reductions Erlang VM调度器是基于归约(reductions),可以方便地调度任意单位的工作(基于时间的调度器通常都很难把工作做得跟Erlang一样有效率)。归约越高,进程消耗在CUP和函数调用的工作量越大。

1> recon:info("<0.12.0>").
 For the sake of convenience, recon:info/1 will accept any pid-like first argument and handle it: literal pids, strings ("<0.12.0>"), registered atoms, global names ({global, Atom}), names registered with a third-party registry (e.g. with gproc: {via, gproc, Name}), or tuples ({0,12,0}). The process just needs to be local to the node you’re debugging.
If only a category of information is wanted, the category can be used directly:

 为了方便起见,recon:info/1可以接受任意像pid类型的参数:文字的pids,或字符串("<0.12.0>"),注册的原子,全局名称({global, Atom}),使用第三方的注册流程注册过的进程名(比如:gproc: {via, gproc, Name}),或元组({0,12,0})。这些进程只需要是你调试的本地节点的进程就ok。如果你只需要某项信息,你可以使用下面的选项:

2> recon:info(self(), work).
or can be used in exactly the same way as process_info/2:


3> recon:info(self(), [memory, status]).
This latter form can be used to fetch unsafe information.
With all this data, it’s possible to find out all we need to debug a system. The challenge then is often to figure out, between this per-process data, and the global one, which process(es) should be targeted.
When looking for high memory usage, for example it’s interesting to be able to list all of a node’s processes and find the top N consumers. Using the attributes above and the recon:proc_count(Attribute, N) function, we can get these results:


4> recon:proc_count(memory, 3).
 Any of the attributes mentioned earlier can work, and for nodes with long-lived processes that can cause problems, it’s a fairly useful function.
 There is however a problem when most processes are short-lived, usually too short to inspect through other tools, or when a moving window is what we need (for example, what processes are busy accumulating memory or running code right now).
For this use case, Recon has the recon:proc_window(Attribute, Num, Milliseconds) function.
It is important to see this function as a snapshot over a sliding window. A program’s timeline during sampling might look like this:


--w---- [Sample1] ---x-------------y----- [Sample2] ---z--->

 The function will take two samples at an interval defined by Milliseconds.
 Some processes will live between w and die at x, some between y and z, and some between x and y. These samples will not be too significant as they’re incomplete.
If the majority of your processes run between a time interval x to y (in absolute terms), you should make sure that your sampling time is smaller than this so that for many processes, their lifetime spans the equivalent of w and z. Not doing this can skew the results: long-lived processes that have 10 times the time to accumulate data (say reductions) will look like huge consumers when they’re not one. 22 The function, once running gives results like follows:

 该函数需要两个样本的时间间隔定义为毫秒。  一些函数只存活在w与x之间,另一些函数存活在y和z之间,还有一些存活在x和y之间。
 如果大部分进程都运行在x和y(按绝对值算),你就应该确保你的采样时间比这个小,来让他们一生可以跨越整个w~z生命周期。如果不这样做,就会使结果不准确:长驻进程可以有10次来累积数据(reductions)可能会被误认为是消耗大户22。  这函数会返回出如下的结果:

5> recon:proc_window(reductions, 3, 500).
 With these two functions, it becomes possible to hone in on a specific process that is causing issues or misbehaving.


[17] In cases where processes contain sensitive information, data can be forced to be kept private by calling process_flag(sensitive, true)
[18] For all options, look at
[19] See and
[20] See and for more details.
[21] See Section 7.2
[22] Warning: this function depends on data gathered at two snapshots, and then building a dictionary with entries to differentiate them. This can take a heavy toll on memory when you have many tens of thousands of processes, and a little bit of time.
