By all means, processes are an important part of a running Erlang system. And because they’re so central to everything that goes on, there’s a lot to want to know about them. Fortunately, the VM makes a lot of information available, some of which is safe to use, and some of which is unsafe to use in production (because they can return data sets large enough that the amount of memory copied to the shell process and used to print it can kill the node).
Meta
1. dictionary returns all the entries in the process dictionary 19. Generally safe to use, because people shouldn’t be storing gigabytes of arbitrary data in there.
2. group_leader the group leader of a process defines where IO (files, output of io:format/1-3) goes. 20
3. registered_name if the process has a name (as registered with erlang:register/2),
it is given here.
4. status the nature of the process as seen by the scheduler. The possible values are:
a). exiting the process is done, but not fully cleared yet;
b). waiting the process is waiting in a receive ... end;
c). running self-descriptive;
d). runnable ready to run, but not scheduled yet because another process is running;
e). garbage_collecting self-descriptive;
f). suspended whenever it is suspended by a BIF, or as a back-pressure mechanism
because a socket or port buffer is full. The process only becomes runnable
again once the port is no longer busy.
Signals
links will show a list of all the links a process has towards other processes and also
ports (sockets, file descriptors). Generally safe to call, but to be used with care
on large supervisors that may return thousands and thousands of entries.
monitored_by gives a list of processes that are monitoring the current process (through
the use of erlang:monitor/2).
monitors kind of the opposite of monitored_by; it gives a list of all the processes
being monitored by the one polled here.
trap_exit has the value true if the process is trapping exits, false otherwise.
Memory Used
binary Displays the all the references to refc binaries 21 along with their size. Can be unsafe to use if a process has a lot of them allocated.
garbage_collection contains information regarding garbage collection in the process. The content is documented as ’subject to change’ and should be treated as
such. The information tends to contains entries such as the number of garbage
collections the process has went through, options for full-sweep garbage collections, and heap sizes.
heap_size A typical Erlang process contains an ’old’ heap and a ’new’ heap, and
goes through generational garbage collection. This entry shows the process’
heap size for the newest generation, and it usually includes the stack size. The
value returned is in words.
memory Returns, in bytes, the size of the process, including the call stack, the heaps,
and internal structures used by the VM that are part of a process.
message_queue_len Tells you how many messages are waiting in the mailbox of a
process.
messages Returns all of the messages in a process’ mailbox. This attribute is extremely dangerous to request in production because mailboxes can hold millions
of messages if you’re debugging a process that managed to get locked up. Always
call for the message_queue_len first to make sure it’s safe to use.
total_heap_size Similar to heap_size, but also contains all other fragments of the
heap, including the old one. The value returned is in words.
Work
reductions The Erlang VM does scheduling based on reductions, an arbitrary unit
of work that allows rather portable implementations of scheduling (time-based
scheduling is usually hard to make work efficiently on as many OSes as Erlang
runs on). The higher the reductions, the more work, in terms of CPU and
function calls, a process is doing.
Fortunately, for all the common ones that are also safe, recon contains the recon:info/1
function to help:
1> recon:info("<0.12.0>").
[{meta,[{registered_name,rex},
{dictionary,[{’$ancestors’,[kernel_sup,<0.10.0>]},
{’$initial_call’,{rpc,init,1}}]},
{group_leader,<0.9.0>},
{status,waiting}]},
{signals,[{links,[<0.11.0>]},
{monitors,[]},
{monitored_by,[]},
{trap_exit,true}]},
{location,[{initial_call,{proc_lib,init_p,5}},
{current_stacktrace,[{gen_server,loop,6,
[{file,"gen_server.erl"},{line,358}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,239}]}]}]},
{memory_used,[{memory,2808},
{message_queue_len,0},
{heap_size,233},
{total_heap_size,233},
{garbage_collection,[{min_bin_vheap_size,46422},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,0}]}]},
{work,[{reductions,35}]}]
-------------------------------------------------------------------
2> recon:info(self(), work).
{work,[{reductions,11035}]}
-------------------------------------------------------------------
or can be used in exactly the same way as process_info/2:
-------------------------------------------------------------------
3> recon:info(self(), [memory, status]).
[{memory,10600},{status,running}]
-------------------------------------------------------------------
This latter form can be used to fetch unsafe information.
With all this data, it’s possible to find out all we need to debug a system. The challenge then is often to figure out, between this per-process data, and the global one, which
process(es) should be targeted.
When looking for high memory usage, for example it’s interesting to be able to list all
of a node’s processes and find the top N consumers. Using the attributes above and the
recon:proc_count(Attribute, N) function, we can get these results:
-------------------------------------------------------------------
4> recon:proc_count(memory, 3).
[{<0.26.0>,831448,
[{current_function,{group,server_loop,3}},
{initial_call,{group,server,3}}]},
{<0.25.0>,372440,
[user,
{current_function,{group,server_loop,3}},
{initial_call,{group,server,3}}]},
{<0.20.0>,372312,
[code_server,
{current_function,{code_server,loop,1}},
{initial_call,{erlang,apply,2}}]}]
-------------------------------------------------------------------
Any of the attributes mentioned earlier can work, and for nodes with long-lived processes
that can cause problems, it’s a fairly useful function.
There is however a problem when most processes are short-lived, usually too short to
inspect through other tools, or when a moving window is what we need (for example, what
processes are busy accumulating memory or running code right now).
For this use case, Recon has the recon:proc_window(Attribute, Num, Milliseconds)
function.
It is important to see this function as a snapshot over a sliding window. A program’s
timeline during sampling might look like this:
--w---- [Sample1] ---x-------------y----- [Sample2] ---z--->
The function will take two samples at an interval defined by Milliseconds.
Some processes will live between w and die at x, some between y and z, and some
between x and y. These samples will not be too significant as they’re incomplete.
If the majority of your processes run between a time interval x to y (in absolute terms),
you should make sure that your sampling time is smaller than this so that for many processes, their lifetime spans the equivalent of w and z. Not doing this can skew the results:
long-lived processes that have 10 times the time to accumulate data (say reductions) will
look like huge consumers when they’re not one. 22
The function, once running gives results like follows:
-------------------------------------------------------------------
5> recon:proc_window(reductions, 3, 500).
[{<0.46.0>,51728,
[{current_function,{queue,in,2}},
{initial_call,{erlang,apply,2}}]},
{<0.49.0>,5728,
[{current_function,{dict,new,0}},
{initial_call,{erlang,apply,2}}]},
{<0.43.0>,650,
[{current_function,{timer,sleep,1}},
{initial_call,{erlang,apply,2}}]}]
-------------------------------------------------------------------
With these two functions, it becomes possible to hone in on a specific process that is
causing issues or misbehaving.
[17] In cases where processes contain sensitive information, data can be forced to be kept private by calling process_flag(sensitive, true)
[18] For all options, look at http://www.erlang.org/doc/man/erlang.html#process_info-2
[19] See http://www.erlang.org/course/advanced.html#dict and http://ferd.ca/on-the-use-of-the-processdictionary-in-erlang.html
[20] See http://learnyousomeerlang.com/building-otp-applications#the-application-behaviour and http://erlang.org/doc/apps/stdlib/io_protocol.html for more details.
[21] See Section 7.2
[22] Warning: this function depends on data gathered at two snapshots, and then building a dictionary with
entries to differentiate them. This can take a heavy toll on memory when you have many tens of thousands
of processes, and a little bit of time.